What I’m looking for is an operation which would “partially collapse” my results so that documents which have a certain field are grouped, removing what could be seen as near duplicates, but all docs missing this field remain unaffected by the field collapsing.
(Specifically, the docs in question are individual posts in a discussion forum, which in turn is organized in threads. Since the forum displays a whole thread per page, multiple hits in the same thread are essentially duplicates as far as the user is concerned and as a thread grows long, this is quite inevitable if the users stick to the subject. However, there are many other types of docs for which this collapsing does not make any sense at all.)
Using Solr 3.5, the closest I’ve gotten is
...&group=true&group.main=true&group.field=threadid&group.limit=3
but it appears that Solr is treating “missing” as a value and collapses everything else into 3 hits – I would like it to treat missing values as “unique”.
Can this be done or should I consider revising the schema design?
I don’t think this is directly possible with the existing query parameters in solr.
You have two options which might work:
Ensure each post has a
threadidsuch that one-off posts have a unique threadid which does not conflict with the ‘normal’threadids. When grouping on this field, these posts will show up in their own groups.Run two queries, one with the grouping enabled, but an
fqparameter which filters out posts without athreadid(e.g.fq=threadid:[* TO *]), then another query for only the non-threaded posts with an inversefq(fq=-threadid:[* TO *]), then merge these results in your own code.