I don’t think this is a very obscure Lucene problem, but somehow I just don’t seem to be able to find a good solution to it. I will use an example.
Let’s say I am building a news articles website. Registered users can bookmark articles that they are interested in. I want to allow users to search for only articles that he/she bookmarks. For the sake of example, let’s also assume that a user can potentially bookmark thousands of articles, and we have hundreds of thousands of users in our database. How do I build a scalable solution for this problem?
Thanks a lot!
This is a very typical Lucene problem as it does not support joins. More specifically, there’s no first class support and you have to find your ways around it. I can suggest a few:
You could have a database, which has
users,articlesandbookmarkstables (the latter would have foreign keys pointing to the first two). You would also have articles indexed in Lucene. When running a search against articles, you could write a LuceneFilterwhich would exclude all articles not bookmarked by the current user.You could index all articles and bookmarks in Lucene – probably best if you do this using separate indices. Then you could run a query for bookmarks (to retrieve which articles current user has bookmarked) and then run another separate query for articles. Like in the previous example, you could use the results of the first query to exclude all other articles which are not bookmarked by the current user.
I personally prefer option #1 as this is classical relational structure and databases are designed for exactly this purpose. With the option #2 you would have to modify both user storage and Lucene index when user gets deleted.