I am working to create a very big inverted index terms. What method would you suggest?
First
termId - > docId
a doc2[locations],doc5[locations],doc12[locations]
b doc5[locations],doc7[locations],doc4[locations]
Second
termId - > docId
a doc2[locations]
a doc5[locations]
a doc12[locations]
b doc5[locations]
b doc7[locations]
b doc4[locations]
p.s Lucene is not an option
The right table design depends on how you plan on using the data. If you plan on using strings like
"doc2[locations],doc5[locations],doc12[locations]"as is — without any further postprocessing, then yourFirstdesign is fine.But if — as your question tacitly suggests — that you may at times want to regard
doc2[locations],doc5[locations], etc. as separate entities, then you should definitely use yourSeconddesign.Here are some use cases which show why the
Seconddesign is better:If you use
Firstand ask for all docs withtermID = athen youget back a string like
doc2[locations],doc5[locations],doc12[locations]which you thenhave to split.
If you use Second, you get each doc as a separate row. No splitting!
The
Secondstructure is more convenient.Or, suppose at some point
doc5[locations]changes and you need toupdate your table. If you use the
Firstdesign, you’d have to usesome relatively complicated MySQL string function to find and replace the substring in all rows that contain it. (Note that MySQL does not come with regex substitution built in.)
If you use the
Seconddesign, updating is easy: