I am trying to parse a plain text file using Tika but getting inconsistent

Question

0

Editorial Team

Asked: May 23, 20262026-05-23T14:45:37+00:00 2026-05-23T14:45:37+00:00

I am trying to parse a plain text file using Tika but getting inconsistent

0

I am trying to parse a plain text file using Tika but getting inconsistent
behavior.

More specifically, I have defined a simple handler as follows:

public class MyHandler extends DefaultHandler
{
     @Override
     public void characters(char ch[], int start, int length) throws SAXException
     {
        System.out.println(new String(ch));
     }
}

Then, I parse the file (“myfile.txt“) as follows:

Tika tika = new Tika();
InputStream is = new FileInputStream("myfile.txt");

Metadata metadata = new Metadata();
ContentHandler handler = new MyHandler();

Parser parser = new TXTParser();
ParseContext context = new ParseContext();

String mimeType = tika.detect(is);
metadata.set(HttpHeaders.CONTENT_TYPE, mimeType);

tikaParser.parse(is, handler, metadata, context);

I would expect all the text in the file to be printed out on screen, but a
small part in the end is not. More specifically, the characters() callback
keeps reading 4,096 characters per callback but in the end it apparently
leaves out the last 5,083 characters of this particular file (which is a few
MB long), so it even goes beyond missing the last callback.

Also, testing on another, small file, which is about 5,000 characters long,
no callback seems to take place!

The MIME type is correctly detected as text/plain in both cases.

Any ideas?

Thanks!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T14:45:37+00:00

What version of Tika are you using? Looking at the source code it reads chunks of 4096 bytes which can be seen on line 129 of TXTParser. At line 132 the characters(...) routine is invoked.

In short, the target code is:

   char[] buffer = new char[4096];
   int n = reader.read(buffer);
   while (n != -1) {
       xhtml.characters(buffer, 0, n);
       n = reader.read(buffer);
   }

where reader is a BufferedReader. I cannot see any flaw in this code, hence I’m thinking you might be working an older version?

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to parse a plain text file using Tika but getting inconsistent

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply