I need to optimize the actual loading/parsing of a csv file (strings). The best way I know is the load-in-place algorithms and I successfully used it using JNI and a C++ dll that loads the data directly from a file made out of the parsed csv data.
It would have been fine if it stopped there but using that scheme only made it 15% faster (no more parsing of the data). One of the reason it is not as fast as I first thought it would be is because the java client uses jstring so I need to convert the actual data again from char* to jstring.
The best would be to ignore that conversion step and load-in-place the data directly into the jstring objects (no more conversion). So instead of duplicating the data based on the loaded-in-place data, the jstring would be pointing directly into the chunk of memory (note that the data would be made of jchars instead of chars). The real bad thing is that we would need to make sure the garbage collector doesn’t collect that data (by keeping a reference to it maybe?) but it should be feasible.. no?
I think I have two options to do that:
1- Load the data in java (no more jni) and use chars that are pointing to the loaded data to create the strings.. but I need to find a way to prevent the duplicating of the data when creating a String.
2- Continue using jni to “manually” create and set the jstring variable and make sure that the garbage collector options are set properly to prevent it from doing anything to it. For instance:
jstring str;
str.data = loadedinplacedata; // assign data pointer
return str;
Not sure if that’s possible but I wouldn’t mind just save the jstring directly into the file and reload it like that:
jstring * str = (jstring *)&loadedinplacedata[someoffset];
return * str;
I’m aware that this is not the usual Java thing, but I’m pretty sure Java is extensible enough to be able to do that. And it’s not like I really have a choice in the matter… the project is already 3 years old and it needs to work. =S
This is the JNI code (C++):
const jchar * data = GetData(id, row, col); // get pointer of the string ends w/ \0
unsigned int len = wcslen( (wchar_t*)data );
// The best would be to prevent this function to duplicate the data.
jstring str = env->NewString( data, len );
return str;
Note: The code above made it 20% faster (instead of 15) by using unicode data instead of UTF8 (NewString instead of NewStringUTF). This shows that if I can remove that step or optimize it, I’d get quite the good performance increase.
Well… seems like what I wanted to do is not “supported” by Java unless I hack it.. I believe it would be possible to do so by using GetStringCritical to get the actual char array address and then find out the number of characters and such but this is way beyond “safe” programming.
The best work around I found was to create a hash table in java and use an unique identifier processed while creating my data file (acting similar to .intern()). if the string was not in the hash table, it would query it through the dll and save it in the hash table.
data file:
numrow,numcols,
for each cell, add a integer value (in my case the offset in memory pointing to the string)
for each cell, add string ending with \0
By using the offset value, I can somewhat minimize the number of strings creation and string queries. I tried using globalref to keep the string inside the dll but that made it 4 times slower.