Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6055615
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 23, 20262026-05-23T08:17:04+00:00 2026-05-23T08:17:04+00:00

I need to index some xml documents with Lucene, but before that, i need

  • 0

I need to index some xml documents with Lucene, but before that, i need to parse those XML and extract some info inside their tags.

The XML looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<tt xml:lang="es" xmlns="http://www.w3.org/2006/04/ttaf1"  xmlns:tts="http://www.w3.org/2006/04/ttaf1#styling">
  <head>
        <styling>
            <style id="bl" tts:fontWeight="bold" tts:color="#FFFFFF" tts:fontSize="15" tts:fontFamily="sansSerif"/>
       </styling>
  </head>

  <body>
    <div xml:lang="es">
            <p begin="00:00.50" end="00:04.02" style="bl">Info</p>
            <p begin="00:04.32" end="00:07.68" style="bl">Different words,<br />and phrases to index</p>
            <p begin="00:11.76" end="00:16.04" style="bl">Text</p>
            <p begin="00:18.52" end="00:22.88" style="bl">More and<br />more text</p>
   </div>
  </body>
</tt>

I need to extract only the timestamps inside the tags begin and end, and then index the text inside the p tags. The goal is to query the text indexed and know in which timestamp gap are each hit.

For example, if i query the word “Text” the output should say something like: “2 hits, 00:11.76-00:16.04, 00:18.52-00:22.88”

I started indexing the entire XML with Lucene. Now i want to parse the file, but im not sure what is the best approximation to solve this problem.

Any help or advice is welcome 🙂
Thank you all!

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-23T08:17:04+00:00Added an answer on May 23, 2026 at 8:17 am

    I used the SAX library (i.e., a subclass of org.xml.sax.helpers.DefaultHandler ) to parse XML files, extracted the desired information from each XML document into my own Document class, and then indexed that Document instance. (The indirection was due to having multiple document formats that had to be parsed separately, but indexed in the same index.) In your case, if the contents of each of your <body> elements represents a logical document, you can store the date information as payloads associated with specific tokens. Parse the XML to the <p> level, enumerate the paragraph instances, and for each instance, add a new Field instance with the same name, where the value is the text, and the payload is the date information, suitably represented. (Payloads are binary, so, for example, you could store the two long values corresponding to the start and end times.) When you add multiple field instances with the same name to a document, they get indexed as the same field, but you can assign different payloads to each instance, you can adjust the position of the start of the text, etc.

    If you don’t need the contents of each element as a single document, you can treat each <p> as a separate document, and then set the payload on that. Alternatively, you can store dates as a separate field.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have an application that stores xml documents inside a column on SQL Server.
My index.html page for my project needs some Ruby code so I need to
In SQL Server (2005+) I need to index a column (exact matches only) that
Am building a Book search API using Lucene. I need to index Book Name,Author,
I need an associative container that makes me index a certain object through a
I need to generate a number of XML documents from Java objects. The objects
I am currently parsing some xml that looks like this <Rows> <Row MANUFACTURERID=76 MANUFACTURERNAME=Fondont
I need to get key/value pairs from XML to populate member info on a
i need some information what is a hbm.xml file ??? i know just *.hbm
I need some help making a linq query that will select a list of

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.