I have a huge file with data (~8Gb / ~80 Million records). Every record has 6-8 attributes which are split by a single tab. I would like for starters to copy some given attributes in another file. So I would like a more elegant code than the above, for example if I want only the second and the last token from a total of 4:
StringTokenizer st = new StringTokenizer(line, "\t");
st.nextToken(); //get rid of the first token
System.out.println(st.nextToken()); //show me the second token
st.nextToken(); //get rid of the third token
System.out.println(st.nextToken()); //show me the fourth token
I’m reminding that it’s a huge file so I have to avoid any redundant if checks.
Your question got me wondering about performance. Lately I’ve been using Guava’s Splitter where possible, just because I dig the syntax. I’ve never measured performance, so I put together a quick test of four parsing styles. I put these together really quickly, so pardon mistakes in style and edge-case correctness. They’re based on the understanding that we’re only interested in the second and fourth items.
What I found interesting is that the “homeGrown” (really crude code) solution is the fastest when parsing a 350MB tab-delimited text file (with four columns), ex:
When operating over 350MB of data on my laptop, I got the following results:
Given that, I think I’ll stick with Guava’s splitter for most work and consider custom code for larger data sets.
Note that all of these would likely be slower with proper bound checking and more elegant implementation.