I am looking into indexing engines, specifically Apache Lucene Solr. We are willing to use it for our searches, yet one of the problems solved by our frameworks search is row-level access.
Solr does not provide record access out of the box:
<…> Solr does not concern itself with security either at the document level or the communication level.
And in the section about document level security: http://wiki.apache.org/solr/SolrSecurity#Document_Level_Security
There are few suggestions – either use Manifold CF (which is highly undocumented and seems in a very pre-beta stage) or write your own request handler/search component (that part is marked as stub) – I guess that the later one would have bigger impact on performance.
So I assume not much is being done in this field.
In the recently released 4.0 version of Solr, they have introduced joining two indexed entities. Joining might seem a nice idea, since our framework also does a join to know whether the record is accessible for the user. The problem here is that sometimes we do a inner join, and sometimes and outer (depending on the optimistic (everything what’s not forbidden is allowed) or pessimistic (everything is forbidden only what is explicitly allowed) security setting in the scope).
To give a better understanding of what our structure looks like:
Documents
DocumentNr | Name
------------------
1 | Foo
2 | Bar
DocumentRecordAccess
DocumentNr | UserNr | AllowRead | AllowUpdate | AllowDelete
------------------------------------------------------------
1 | 1 | 1 | 1 | 0
So for example the generated query for the Documents in pessimistic security setting would be:
SELECT * FROM Documents AS d
INNER JOIN DocumentRecordAccess AS dra ON dra.DocumentNr=d.DocumentNr AND dra.AllowRead=1 AND dra.UserNr=1
This would return only the foo, but not the bar. And in optimistic setting:
SELECT * FROM Documents AS d
LEFT JOIN DocumentRecordAccess AS dra ON dra.DocumentNr=d.DocumentNr AND dra.AllowRead=1 AND dra.UserNr=1
Returning both – the Foo and the Bar.
Coming back to my question – maybe someone has already done this and can share their insight and experience?
I am afraid there’s no easy solution here. You will have to sacrifice something to get ACLs working together with the search.
If your corpus size is small (I’d say up to 10K documents), you could create a cached bit set of forbidden (or allowed, whichever less verbose) documents and send relevant filter query
(+*:* -DocumentNr:1 ... -DocumentNr:X). Needless to say, this doesn’t scale. Sending large queries will make the search a bit slower, but this is manageable (up to a point of course). Query parsing is cheap.If you can somehow group these documents and apply ACLs on document groups, this would allow cutting on query length and the above approach would fit perfectly. This is pretty much what we are using – our solution implements taxonomy and has taxonomy permissions done via
fqquery.If you don’t need to show the overall result set count, you can run your query and filter the result set on the client side. Again, not perfect.
You can also denormalize your data structures and store both tables flattened in a single document like this:
DocumentNr: 1
Name: Foo
Allowed_users: u1, u2, u3 (or Forbidden_users: …)
The rest is as easy as sending user id with your query.
Above is only viable if the ACLs are rarely changing and you can afford reindexing the entire corpus when they do.
You could write a custom query filter which would have cached
BitSets of allowed or forbidden documents by user(group?) retrieved from the database. This would require not only providing DB access for Solr webapp but also extending/repackaging the .war which comes with Solr. While this is relatively easy, the harder part would be cache invalidation: main app should somehow signal Solr app when ACL data gets changed.Options 1 and 2 are probably more reasonable if you can put Solr and your app onto the same JVM and use javabin driver.
It’s hard to advice more without knowing the specifics of the corpus/ACLs.