If I add a custom Attribute, for example part of speech, to TokenStream is it used in indexing process?
Can I retrieve this attribute from the index? Is it stored for every token?
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
If I understand what you are looking for here, I think you would need to create you own custom TokenStream (extending a standard TokenStream, I would think) to accomplish this, and determine how you want to store all this extra information. And how to meaningfully retrieve that information from the index.
I know of no way to accomplish something like that out-of-the-box.
Off the top of my head, I’d think you’dd need to write a new document for each token coming through your custom tokenstream. Then on searching, use a highlighter, or some such, to get which terms a query is matching on and query the index again to retrieve these metadata documents about that term. This assumes that any token reused by this or another document that is written will have the same metadata assigned to it. If that’s not the case, you’dd have to determine how to indentify the documents you were looking for that wouldn’t be sensitive to collisions.
Or you could write another field of the same document, creating an ordered list of metadata for each token paralleling the structure of the data. Store both, use a highlighter again to find the searched for result, and parse out the matching position in the list your tokenstream created.
Well, that’s a couple of thoughts anyway.