I have a flat file of e-mail header data that I’m trying to parse

Question

0

Asked: June 9, 20262026-06-09T06:18:22+00:00 2026-06-09T06:18:22+00:00

I have a flat file of e-mail header data that I’m trying to parse

0

I have a flat file of e-mail header data that I’m trying to parse for analysis. The file will always have fields in order as follows: Record Number, 1 or 2 bytes, "From:" followed by the sender’s name and "Sent:" followed by the date sent.

1 From: Person.Name Sent: April 12, 2010
2 From:<tab>Person.Name Sent: April 30, 2011
10 From: Person.Name Sent: June 29, 2012
11 From:<tab>Person.Name Sent: July 8, 2012

Using BufferedReader I am reading a the file line-by-line and defining a substring of the Name based on all characters between the indeces of "From:" and "Sent:".

String sender = inputLine.substring((inputLine.indexof("From:")+6),(inputLine.indexOf("Sent:")-1));

In this case, I’m grabbing everything AFTER “From: ” (sixth byte excludes the word, the colon, and the space/single byte after the colon) through one LESS than the position of “Sent: ” (the space before the S).

However, I’m getting unexpected output when I run the job. Some of my input data appears to have a tab after "From: " and some lines do not. When a tab is present, my output include the last two or three bytes of "From: " (when the record number is a single digit, I get m:<tab>, for double digit record numbers it’s om:<tab>.

Person.Name
m:<tab>Person.Name        <-- single digit record number
Person.Name        
om:<tab>Person.Name       <-- double digit record number

EDIT: When I amend my substring to

String sender = inputLine.substring((inputLine.indexof("From:\t")+6),(inputLine.indexOf("Sent:")-1));

ONLY the records with a space (and not a tab) prepent the end of the From: to the output.

Person.Name        <-- records with From:<tab>
om: Person.Name    <-- records with From:<space>

I’m now wondering if I understand substring correctly. My statement above is based on an understanding of substring(x,y) where x is the start and y is the end of the string. Is that correct?

Since indexOf(“From:”) is intended to represent an integer value of 2 or 3 (depending on a 1 or 2 byte record number, e.g., 1 From: or 10 From:) I would think that adding a value of 6 would give me an index value that falls AFTER the : in index 8 or 9 from the front of the line. So why does it appear to be viewing this as an index of 5–regardless?

           111111111122222222222  |
 0123456789012345678901234567890  + index values
 1 From: Person.Name Sent: June
 10 From: Person.Name Sent: July

The tab is the only difference in the records, and while I understand that a tab character may need to be counted differently than an ASCII space character, SUBTRACTING from the index seems a little strange.

Even more interesting, if I remove the “adjustments” from the statement,

     String sender = inputLine.substring((inputLine.indexof("From:")),(inputLine.indexOf("Sent:")));

I get a -1 out of range exception.

Can someone please explain what’s happening here? I am baffled, and can’t find answers this specific in oracle’s java documentation.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-09T06:18:24+00:00

Editorial Team

2026-06-09T06:18:24+00:00Added an answer on June 9, 2026 at 6:18 am

I ended up creating new input fields that replaced \t with a space. Then everything worked fine. What it was about the tab character that threw things off is still a mystery.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a flat file of e-mail header data that I’m trying to parse

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply