I’m using Elasticsearch, and writing my own wrapper using WebRequest since NEST (the usual choice) bafflingly seems to lack the ability to insert an item and have the generated ID returned.
Anyway – no problems with the general method. But, any HTML content is indexed as-is, i.e. if I have <strong>test</strong> in a field, then a search for the query “strong” returns the item.
I’ve put this in elasticsearch.yml, based on a random message board post I found:
index:
analysis:
analyzer:
htmlContentAnalyzer:
type: custom
tokenizer: standard
filter: standard
char_filter: html_strip
Then, I create an mapping thusly for my index ‘content’, item type ‘news’:
PUT http://localhost:9200/content/news/_mapping
{
"news" : {
"properties" : {
"TextContent" : {
"type" : "string",
"index" : "analyzed",
"analyzer" : "htmlContentAnalyzer",
"store" : "yes"
}
}
}
}
}
The store/yes is just for “fun”, it makes no difference. The above gives me a 200 OK.
However, the search returns the same results.
What doesn’t help is that elasticsearch documentation seems appalling. Check out this page:
http://www.elasticsearch.org/guide/reference/api/admin-indices-put-mapping.html
it gives you a brief rundown of what mapping is, and says more details are in the mapping section, i.e. this page:
http://www.elasticsearch.org/guide/reference/mapping/
…which seems to be truly terrible. There’s nothing referring to the format/object graph I found – no mention of “properties”, “type”, “analyzer”, “index” etc. There are some sections on the menu on the right, e.g. “_index”, but they seem to refer to the item as a whole? And where is that pointed out?
So my question is on two fronts:
- How do I stop HTML tags (and entities, attribute values I guess) being indexed? – I still want the HTML stored, mind you
- Is there a better source for elasticsearch info/documentation? Or am I looking at it without the super-secret decoder glasses?
With all credit to chrismale on #elasticsearch (freenode IRC) –
Searching against
_allis no good: that is indexed with its own analyzer. Querying on myTextContentfield specifically worked as expected.