I have a website with 500k users (running on sql server 2008). I want to now include activity streams of users and their friends. After testing a few things on SQL Server it becomes apparent that RDMS is not a good choice for this kind of feature. it’s slow (even when I heavily de-normalized my data). So after looking at other NoSQL solutions, I’ve figured that I can use MongoDB for this. I’ll be following data structure based on activitystrea.ms
json specifications for activity stream
So my question is: what would be the best schema design for activity stream in MongoDB (with this many users you can pretty much predict that it will be very heavy on writes, hence my choice of MongoDB – it has great “writes” performance. I’ve thought about 3 types of structures, please tell me if this makes sense or I should use other schema patterns.
1 – Store each activity with all friends/followers in this pattern:
{
_id:'activ123',
actor:{
id:person1
},
verb:'follow',
object:{
objecttype:'person',
id:'person2'
},
updatedon:Date(),
consumers:[
person3, person4, person5, person6, ... so on
]
}
2 – Second design: Collection name- activity_stream_fanout
{
_id:'activ_fanout_123',
personId:person3,
activities:[
{
_id:'activ123',
actor:{
id:person1
},
verb:'follow',
object:{
objecttype:'person',
id:'person2'
},
updatedon:Date(),
}
],[
//activity feed 2
]
}
3 – This approach would be to store the activity items in one collection, and the consumers in another. In activities, you might have a document like:
{ _id: "123",
actor: { person: "UserABC" },
verb: "follow",
object: { person: "someone_else" },
updatedOn: Date(...)
}
And then, for followers, I would have the following “notifications” documents:
{ activityId: "123", consumer: "someguy", updatedOn: Date(...) }
{ activityId: "123", consumer: "otherguy", updatedOn: Date(...) }
{ activityId: "123", consumer: "thirdguy", updatedOn: Date(...) }
Your answers are greatly appreciated.
I’d go with the following structure:
Use one collection for all actions that happend,
ActionsUse another collection for who follows whom,
SubscribersUse a third collection,
Newsfeedfor a certain user’s news feed, items are fanned-out from theActionscollection.The
Newsfeedcollection will be populated by a worker process that asynchronously processes newActions. Therefore, news feeds won’t populate in real-time. I disagree with Geert-Jan in that real-time is important; I believe most users don’t care for even a minute of delay in most (not all) applications (for real time, I’d choose a completely different architecture).If you have a very large number of
consumers, the fan-out can take a while, true. On the other hand, putting the consumers right into the object won’t work with very large follower counts either, and it will create overly large objects that take up a lot of index space.Most importantly, however, the fan-out design is much more flexible and allows relevancy scoring, filtering, etc. I have just recently written a blog post about news feed schema design with MongoDB where I explain some of that flexibility in greater detail.
Speaking of flexibility, I’d be careful about that activitystrea.ms spec. It seems to make sense as a specification for interop between different providers, but I wouldn’t store all that verbose information in my database as long as you don’t intend to aggregate activities from various applications.