I am developing an application in Java which will parse a XML file and retrieve keywords from it and store it in my database. These keywords can then be searched by users and they can retrieve the related data.
Now the problem is that the XML file has words like “literacy_male”,”infantmortalityrate_female” etc. For the first one I can split the words at “_” before storing, but for the second one I am not sure how i can split the word into meaningful words.
I am using Apache Lucene to do the full text search.
one possibility is increasing the index size by adding all substrings of the exact same string. so for “abc” you will store: “a”,”b”,”c”,”ab”,”bc”,”abc” (it’s O(n^2) strings).
one more possibility is using wildcards. index whatever you have, and search for:
<term>*,a*<term>*,…,z*<term>*instead of for<term>. it will take a LOT more time, but it will not increase the index size.note: it is necessary to search for so many terms because you CANNOT use wildcard as first letter of a term.
a*<term>*means search for all terms start with a, then have none or any chars, then<term>and then none or any chars again.more info about terms and wild cards in lucene: http://lucene.apache.org/java/2_0_0/queryparsersyntax.html
EDIT:
a combination of those will provide (in my opinion) the best solution:
index all suffixes of the string, and then for each term (and not query!) – instead of searching for
<term>search for<term>*. if the term exist as a substring, it also starts at least one prefix, and it will find it.for example: if you have
"lifeexpectancy", you will index:"lifeexpectancy","ifeexpectancy","feexpectancy","eexpectancy",....,"y"for the same example, when you want to search
life expectancy, you will searchlife* expectancy*