I encountered some very weird behaviour which I think is a bug, but I might be wrong or not understanding the documentation properly so I am asking.
I have a SOLR index and working with the new functions of the 4.0 version.
This is the code I use (I am using the PECL SOLR extension):
<?
$options = array (
'hostname' => '192.168.200.31',
'path' => 'solr/slave',
);
$client = new SolrClient($options);
$query = new SolrQuery();
#$query->setQuery("{!join from=id to=med_id }type:medium");
$query->setQuery("*:*");
$query->addFilterQuery('type:product');
$query->addFilterQuery("product_type:tv_free");
$query_response = $client->query($query);
$response = $query_response->getResponse();
echo '<pre>'.print_r($response,true)."</pre>";
?>
The code above returns 38296 documents.
However if I uncomment the Line #$query->setQuery("*:*");, so that the query is now *:* and effectively matches every document, I get 21867 documents returned – which I think is the correct number.
If you want to know a bit more about the use case and what thoughts are behind, you may read on – but it is only background information:
I am indexing two types of documents that I distinguish by the value of the field type:
-
medium – In my case this is a movie (like avatar, casablanca, etc)
-
product – Those are offers for the movies like a DVD on amazon
The reason for this split is that I want filter/facet queries that enable the user for example to search for:
- a movie that has been released between 1990 and 1955 (this metadata is stored in the medium document)
- and that is available on amazon as dvd for 5% or less (this information is stored in the product document)
- and that has the word “jungle” in the movie title (stored in the medium document)
I am doing a search (using dismax) on all documents of type “medium” with “jungle” in the title:
$query->setQuery("{!type=dismax qf='$qf' mm='1' q.alt='*:*'}jungle");
Then I add a filter queries like this:
$query->addFilterQuery("{!join from=med_id to=id}provider:amazon");
$query->addFilterQuery("{!join from=med_id to=id}price:[0 TO 500]"); // price is in cents
$query->addFilterQuery("release_year:[1990 TO 1995]");
Note that I need the first two queries as a join to the documents of type prdouct, which have a field called med_id which holds the id of the document of type medium associated with them.
This all works fine!
Howver I want to facet the search by metada held in the documents of type product. For example the country where they are available (where I can order the DVD)
I get the facet counts for all fields that are contained in the medium documents from this quere, however join queries do not carry any information of the source tables used to filter the join to the result. So I need a second query:
I do exactly the same as above, but this time I use swap join and not joined queries:
So my dismax query now becomes a join query:
$query->setQuery(“{!join from=id to=med_id }{!type=dismax qf=’$qf’ mm=’1′ q.alt=’:‘}jungle”);
My joined filter queries become normal filter queries:
$query->addFilterQuery("provider:amazon");
$query->addFilterQuery("price:[0 TO 500]");
And my conventional filter query becomes a joined one – this time from the field id to med_id:
$query->addFilterQuery(“!join from=id to=med_id}release_year:[1990 TO 1995]”);
This now returns all products that match our filters. For one medium there may be more than one products – but I only want my facet counts to reflect the number of movies, not the number of products so I also group by med_id and set group truncating to true like this:
$query->addParam("group","true");
$query->addParam("group.field","med_id");
$query->addParam("group.truncate","true");
The only problem with this is that the join query doing the search in the medium fields makes my query somehow return more results and not less, which I boiled down to the minimal code at the beginning of the question to reproduce.
I think i worked around my problem by adding my query as a filter query and not as a query like this:
It seems to work in small test cases however I still have some misfits in my data stock but i need to double check my data sources for any hazards and set up a test case where i can prove the difference.
I am also still interested why it makes problems when setting as query…
Edit:
The method described in this answer effectively solves the problem, however I am not sure why it existed in the first place.
However the effect of the facet counts is not the desired one, because the field collapsing lets solr facet only for the most relevant document in the group.
Meaning:
Without collapsing (grouping) the count may be more than the actual result count of mediums (because several matching products may exist).
With collapsing it may be less (because only the values of one document are taken into account).
So facet counts won’t work this way. The only thing you really know which facet values WILL return at least 1 result and depending on whether you use collapsing or not a number which represents an upper and lower bound but may not be the actual number of results.