I need to put together a data structure that will efficiently provide keyword search facilities.
My metrics are:
- Circa 500,000 products.
- Circa 20+ keywords per product (a guess).
- Products are identified by an ID of about 10 digits but may be any ASCII codes going forward.
I would like to try to fit the data structure in memory if possible. I will be on a server so I can assume some significant memory availability.
Speed is important. Using LIKE database queries will not be an acceptable solution.
Any ideas for a data structure?
My thoughts:
TrieMap
Very efficient for the keywords but there would need to be a list of product IDs hanging off any leaf so seriously memory hungry. Any ideas that could help with that?
Compression
Various compression schemes come to mind but none jump out as of significant value.
Has anyone else put something like this together? Could you share your experiences?
The data may change but not often. It would be reasonable to rebuild the structure on a daily basis to accommodate changes.
Have you thought about using lucene either in memory or as a file system index?
It is quite fast and has lots of room for further requirements that might arise in the future.