Consider a class with a static factory method, which gets a CSV (or TSV) line as input (Variable names renamed for convenience):
String[] fields=StringUtils.split(tsvLine, '\t');
return new MYObject(
Integer.parseInt(fields[0]),
StringUtils.strip(fields[1], "\"").intern(), // Many duplicates
StringUtils.strip(fields[2], "\""), // Unique
StringUtils.strip(fields[4], "\"").intern(), // Many duplicates
Double.parseDouble(fields[7]),
Double.parseDouble(fields[6]));
This method parses around 5 million records, from a file ~500 MB in size. In order to save memory, I save the three Strings concatenated:
I’ve tried the following optimization:
public MyObject(int i1, String str0, String str1, String str2,
double d1, double d2)
{
...
this.tsvStrings = (str0+'\t'+str1+'\t'+str2).toCharArray();
...
}
(These are split, of course, in the appropriate getters and setters).
The process size is still well over 1GB, although most of its contents is ignored. What’s the best way to optimize this? Am I keeping unnecessary references?
EDIT: str0 and str2 have duplicates, str1 is unique.
If you have a file which is encoded with UTF-8, it will use about double that in memory by default (as it uses UTF-16 in memory). That is because String and StringBuilder uses two bytes per character (for most characters)
If you manipulate that data, you can need double or more that amount of memory.
You can make the processing more compact using memory mapped files, and plain bytes etc, but given 16 GB of memory costs about £100 it may be a better use of your time to use more memory.