I need some advice on MongoDB schema design for a natural language database.
I need to store for each language texts and words like:
lang: {
_id: "English",
texts : [
{ text : "This is a first text",
date : Date("2011-09-19T04:00:10.112Z"),
tag : "test1"
},
{ text : "Second One",
date : Date("2011-09-19T04:00:10.112Z"),
tag : "test2"
}
],
words : [
{
word : "This",
},
{
word : "is",
},
{
word : "a",
},
{
word : "first",
},
{
word : "text",
},
{
word : "second",
},
{
word : "one",
}
]
}
And then I need to know each words and texts a user has associated. The word/text amount tends to be huge and I need to list all words on a language and all words a user has associated for that language.
From my perspective I think storing the user_ids that are associated with a given word in an array for the word is maybe a good approach like:
lang: {
_id: "English",
texts : [
...
],
words : [
{
word : "This",
users: [user1,user2,user3]
},
{
word : "is",
users: [user1,user2]
},
...
]
}
Having in mind that a word can be associated to hundreds of thousand of users and the document limit (as I read) is 4MB and that I need to:
- List all words for a given user and language
Is this a good approach? Or can you think of a better one?
Hope this question is clear enough and that someone can give me a help on this 😉
Thank you all!
I don’t think this is a good approach, for just the reason you mention: the document size limit. It looks like with your approach, you are definitely going to run up against the limit. I would go for a flatter approach (which should also make your collection easier to query). Something like this:
In other words, grow vertically by adding documents rather than horizontally by adding more data to one document. You can query words for a given user with db.find( { user: “user1”, lang: “en” });.
This approach isn’t “normalized”, of course, so if you’re concerned about space then you might want to create a separate collection for users, words, and languages and reference them in the main collection by an ID. But since there are no join queries in MongoDB, you have to weigh query performance against space efficiency.