I am using Lucene in PHP (using the Zend Framework implementation). I am having a problem that I cannot search on a field which contains a number.
Here is the data in the index:
ts | contents --------------+----------------- 1236917100 | dog cat gerbil 1236630752 | cow pig goat 1235680249 | lion tiger bear nonnumeric | bass goby trout
My problem: A query for ‘ts:1236630752‘ returns no hits. However, a query for ‘ts:nonnumeric‘ returns a hit.
I am storing ‘ts’ as a keyword field, which according to documentation ‘is not tokenized, but is indexed and stored. Useful for non-text fields, e.g. date or url.’ I have tried treating it as a ‘text’ field, but the behavior is the same except that a query for ‘ts:*‘ returns nothing when ts is text.
I’m using Zend Framework 1.7 (just downloaded the latest 3 days ago) and PHP 5.2.9. Here is my code:
<?php //========================================================= // Initializes Zend Framework (Zend_Loader). //========================================================= set_include_path(realpath('../library') . PATH_SEPARATOR . get_include_path()); require_once('Zend/Loader.php'); Zend_Loader::registerAutoload(); //========================================================= // Delete existing index and create a new one //========================================================= define('SEARCH_INDEX', 'test_search_index'); if(file_exists(SEARCH_INDEX)) foreach(scandir(SEARCH_INDEX) as $file) if(!is_dir($file)) unlink(SEARCH_INDEX . '/$file'); $index = Zend_Search_Lucene::create(SEARCH_INDEX); //========================================================= // Create this data in index: // ts | contents // --------------+----------------- // 1236917100 | dog cat gerbil // 1236630752 | cow pig goat // 1235680249 | lion tiger bear // nonnumeric | bass goby trout //========================================================= function add_to_index($index, $ts, $contents) { $doc = new Zend_Search_Lucene_Document(); $doc->addField(Zend_Search_Lucene_Field::Keyword('ts', $ts)); $doc->addField(Zend_Search_Lucene_Field::Text('contents', $contents)); $index->addDocument($doc); } add_to_index($index, '1236917100', 'dog cat gerbil'); add_to_index($index, '1236630752', 'cow pig goat'); add_to_index($index, '1235680249', 'lion tiger bear'); add_to_index($index, 'nonnumeric', 'bass goby trout'); //========================================================= // Run some test queries and output results //========================================================= echo '<html><body><pre>'; function run_query($index, $query) { echo 'Running query: $query\n'; $hits = $index->find($query); echo 'Got ' . count($hits) . ' hits.\n'; foreach($hits as $hit) echo ' ts='$hit->ts', contents='$hit->contents'\n'; echo '\n'; } run_query($index, 'pig'); //1 hit run_query($index, 'ts:1236630752'); //0 hits run_query($index, '1236630752'); //0 hits run_query($index, 'ts:pig'); //0 hits run_query($index, 'contents:pig'); //1 hits run_query($index, 'ts:[1236630700 TO 1236630800]'); //0 hits (range query) run_query($index, 'ts:*'); //4 hits if ts is keyword, 1 hit otherwise run_query($index, 'nonnumeric'); //1 hits run_query($index, 'ts:nonnumeric'); //1 hits run_query($index, 'trout'); //1 hits
Output
Running query: pig Got 1 hits. ts='1236630752', contents='cow pig goat' Running query: ts:1236630752 Got 0 hits. Running query: 1236630752 Got 0 hits. Running query: ts:pig Got 0 hits. Running query: contents:pig Got 1 hits. ts='1236630752', contents='cow pig goat' Running query: ts:[1236630700 TO 1236630800] Got 0 hits. Running query: ts:* Got 4 hits. ts='1236917100', contents='dog cat gerbil' ts='1236630752', contents='cow pig goat' ts='1235680249', contents='lion tiger bear' ts='nonnumeric', contents='bass goby trout' Running query: nonnumeric Got 1 hits. ts='nonnumeric', contents='bass goby trout' Running query: ts:nonnumeric Got 1 hits. ts='nonnumeric', contents='bass goby trout' Running query: trout Got 1 hits. ts='nonnumeric', contents='bass goby trout'
The find() method tokenizes the query, and with the default Analzer your numbers will be pretty much ignored. If you want to search for a number you have to construct the query or use an alternate analyzer that includes numeric values..
http://framework.zend.com/manual/en/zend.search.lucene.searching.html