I have a flat file of e-mail header data that I’m trying to parse for analysis. The file will always have fields in order as follows: Record Number, 1 or 2 bytes, "From:" followed by the sender’s name and "Sent:" followed by the date sent.
1 From: Person.Name Sent: April 12, 2010
2 From:<tab>Person.Name Sent: April 30, 2011
10 From: Person.Name Sent: June 29, 2012
11 From:<tab>Person.Name Sent: July 8, 2012
Using BufferedReader I am reading a the file line-by-line and defining a substring of the Name based on all characters between the indeces of "From:" and "Sent:".
String sender = inputLine.substring((inputLine.indexof("From:")+6),(inputLine.indexOf("Sent:")-1));
In this case, I’m grabbing everything AFTER “From: ” (sixth byte excludes the word, the colon, and the space/single byte after the colon) through one LESS than the position of “Sent: ” (the space before the S).
However, I’m getting unexpected output when I run the job. Some of my input data appears to have a tab after "From: " and some lines do not. When a tab is present, my output include the last two or three bytes of "From: " (when the record number is a single digit, I get m:<tab>, for double digit record numbers it’s om:<tab>.
Person.Name
m:<tab>Person.Name <-- single digit record number
Person.Name
om:<tab>Person.Name <-- double digit record number
EDIT: When I amend my substring to
String sender = inputLine.substring((inputLine.indexof("From:\t")+6),(inputLine.indexOf("Sent:")-1));
ONLY the records with a space (and not a tab) prepent the end of the From: to the output.
Person.Name <-- records with From:<tab>
om: Person.Name <-- records with From:<space>
I’m now wondering if I understand substring correctly. My statement above is based on an understanding of substring(x,y) where x is the start and y is the end of the string. Is that correct?
Since indexOf(“From:”) is intended to represent an integer value of 2 or 3 (depending on a 1 or 2 byte record number, e.g., 1 From: or 10 From:) I would think that adding a value of 6 would give me an index value that falls AFTER the : in index 8 or 9 from the front of the line. So why does it appear to be viewing this as an index of 5–regardless?
111111111122222222222 |
0123456789012345678901234567890 + index values
1 From: Person.Name Sent: June
10 From: Person.Name Sent: July
The tab is the only difference in the records, and while I understand that a tab character may need to be counted differently than an ASCII space character, SUBTRACTING from the index seems a little strange.
Even more interesting, if I remove the “adjustments” from the statement,
String sender = inputLine.substring((inputLine.indexof("From:")),(inputLine.indexOf("Sent:")));
I get a -1 out of range exception.
Can someone please explain what’s happening here? I am baffled, and can’t find answers this specific in oracle’s java documentation.
I ended up creating new input fields that replaced \t with a space. Then everything worked fine. What it was about the tab character that threw things off is still a mystery.