I’m using VB.NET to process a long fixed-length record. The simplest option seems to be loading the whole record into a string and using Substring to access the fields by position and length. But it seems like there will be some redundant processing within the Substring method that happens on every single invocation. That led me to wonder whether I might get better results using a stream- or array-based approach.
The content starts out as a byte array containing UTF8 character data. A couple of other approaches I’ve thought of are listed below.
- Loading the string into a StringReader and reading blocks of it at a time
- Converting the byte array into a char array and accessing the characters positionally within the array
- (This one seems dumb but I’ll throw it out there) Copying the byte array to a memory stream and using a StreamReader
This is definitely premature optimization; the substring approach may be perfectly acceptable even if it’s a few milliseconds slower. But I thought I’d ask before coding it, just to see if anyone could think of a reason to use one of the other approaches.
The primary cost with substring is the excising of the sub string into a new string. Using Reflector you can see this:
Now to get there (notice that that is not
Substring()) it has to go through 5 checks on length and such.If you are referencing the same substring multiple times then it may well be worth pulling everything out once and dumping the giant string. You will incur overhead in the arrays to store all these substrings.
If it’s generally a “one off” access then Substring it, otherwise consider partitioning up. Perhaps
System.Data.DataTablewould be of use? If you’re doing multiple accesses and parsing to other data types thenDataTablelooks more attractive to me. If you only need one record in memory at a time then aDictionary<string,object>should be sufficient to hold one record (field names have to be unique).Alternatively, you could write a custom, generic class that handles fixed-length record reading for you. Indicate the start index of each field and the type of the field. The length of the field is inferred by the start of the next field (exception is the last field which can be inferred from the total record length). The types can be auto-converted using the likes of
int.Parse(),double.Parse(),bool.Parse(), etc.If reflection suits your fancy:
Simply run through the properties where you can get the
PropertyInfo.PropertyTypeto know how to deal with the sub string from the record; you can pull out the offsets and total length from the attributes; and return an instance of your class with the data populated. Essentially, you could use reflection to pull out information to call RecordParser.AddField() and RecordLength() from my previous suggestion.Then wrap it all up into a neat little, no-fuss class:
Could even go so far to call
r.EnumerateFile("path\to\file")and use theyield returnenumeration syntax to parse out records