I need to index some xml documents with Lucene, but before that, i need

Question

0

Asked: May 23, 20262026-05-23T08:17:04+00:00 2026-05-23T08:17:04+00:00

I need to index some xml documents with Lucene, but before that, i need

0

I need to index some xml documents with Lucene, but before that, i need to parse those XML and extract some info inside their tags.

The XML looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<tt xml:lang="es" xmlns="http://www.w3.org/2006/04/ttaf1"  xmlns:tts="http://www.w3.org/2006/04/ttaf1#styling">
  <head>
        <styling>
            <style id="bl" tts:fontWeight="bold" tts:color="#FFFFFF" tts:fontSize="15" tts:fontFamily="sansSerif"/>
       </styling>
  </head>

  <body>
    <div xml:lang="es">
            <p begin="00:00.50" end="00:04.02" style="bl">Info</p>
            <p begin="00:04.32" end="00:07.68" style="bl">Different words,<br />and phrases to index</p>
            <p begin="00:11.76" end="00:16.04" style="bl">Text</p>
            <p begin="00:18.52" end="00:22.88" style="bl">More and<br />more text</p>
   </div>
  </body>
</tt>

I need to extract only the timestamps inside the tags begin and end, and then index the text inside the p tags. The goal is to query the text indexed and know in which timestamp gap are each hit.

For example, if i query the word “Text” the output should say something like: “2 hits, 00:11.76-00:16.04, 00:18.52-00:22.88”

I started indexing the entire XML with Lucene. Now i want to parse the file, but im not sure what is the best approximation to solve this problem.

Any help or advice is welcome 🙂
Thank you all!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T08:17:04+00:00

I used the SAX library (i.e., a subclass of org.xml.sax.helpers.DefaultHandler ) to parse XML files, extracted the desired information from each XML document into my own Document class, and then indexed that Document instance. (The indirection was due to having multiple document formats that had to be parsed separately, but indexed in the same index.) In your case, if the contents of each of your <body> elements represents a logical document, you can store the date information as payloads associated with specific tokens. Parse the XML to the <p> level, enumerate the paragraph instances, and for each instance, add a new Field instance with the same name, where the value is the text, and the payload is the date information, suitably represented. (Payloads are binary, so, for example, you could store the two long values corresponding to the start and end times.) When you add multiple field instances with the same name to a document, they get indexed as the same field, but you can assign different payloads to each instance, you can adjust the position of the start of the text, etc.

If you don’t need the contents of each element as a single document, you can treat each <p> as a separate document, and then set the payload on that. Alternatively, you can store dates as a separate field.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I need to index some xml documents with Lucene, but before that, i need

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply