I have a problem and Google hasn’t helped me much. I’m trying figure out a way to ignore HTML while searching a Solr index in ColdFusion (9).
For example, if I search for microsoft and my index contains Microsoft© makes Windows® I’m prompted to search for “Microsoft© makes Windows®” rather than showing the actual result.
As you can see below, I’m just passing the string into the criteria property of cfsearch – but again – doing this produces (what I consider to be) a “dirty” result.
<cfsearch
collection="mycollection"
criteria="microsoft"
name="results"
maxrows="100"
suggestions="always"
contexthighlightbegin="<strong>"
contextHighlightEnd="</strong>"
contextPassages="3"
/>
I’ve been looking at the documentation for Solr’s query syntax but I don’t see anything that jumps out at me on how to avoid this problem.
Should I look at providing the index a “flat” version of text or is there a way to avoid HTML strings such as © / ® / ™?
I’m open to suggestions.
— Brian.
For anyone that might be faced with the same question:
The solution for this question was to use an alternate method of indexing rather than trying to work around the HTML within the index.
Within the database I created a new field called
index_searchand on my insert method within my application I used a regex to omit any special(er) characters:"[^[:word:].[:space:]-]"From there, I passed the
index_searchfield to the body ofcfindexand used the HTML name as the title:Using this method produced the expected output when searching for words or phrases close to, or, wrapped in HTML. IE: Searching
microsoftwould lists all results withMicrosoft©within it.