I have a massive text file of strings ordered by line length descending. I would like to load the entire thing into a string array, perform Levenshtein on each one, create a group UUID and put that into an array. So the second array would be a hashtable where the key is the memory address of the former string and the value is a UUID.
I would like to perform pointer arithmetic when iterating over the strings to get the best performance.
After iteratively doing levenshtein ga-zillions of times, I would like to populate another text file whose contents are simply, the UUID of the group, a colon, and the line from the original text file.
I have the levenshtein algorithm from wikibooks:
template<class T> unsigned int levenshtein_distance(const T &s1, const T & s2) {
const size_t len1 = s1.size(), len2 = s2.size();
vector<unsigned int> col(len2+1), prevCol(len2+1);
for (unsigned int i = 0; i < prevCol.size(); i++)
prevCol[i] = i;
for (unsigned int i = 0; i < len1; i++) {
col[0] = i+1;
for (unsigned int j = 0; j < len2; j++)
col[j+1] = min( min( 1 + col[j], 1 + prevCol[1 + j]),
prevCol[j] + (s1[i]==s2[j] ? 0 : 1) );
col.swap(prevCol);
}
return prevCol[len2];
}
I have done some C++, some C, loads of Obj-C. I’m using Windows 7. How do you recommend I do this? What kind string array? How do I convert text strings from a text file to be consumed by the function provided?
I’m basically looking for as many tips as possible, as strings confuse me in C++. Oh and C++ does too!
thanks
For sheer access time, you would be hard pressed to beat a full read-to-memory, then index it by single-pass, building a pointer list and hard-writing a null-terminator at each CR/LF you encounter. the line number would be the index into the container you’re storing all those pointers in, and for that I’d likely use
std::deque<>.The boost:: guys will likely carry this further, but for quick access its hard to beat a big’ol’stack of memory and a raft of pointers indexing it. Of course, this entire thing assumes you can fit it into memory. If you can’t, this gets significantly more complicated, but if you can (and can assume you always can) malloc/walk-and-terminate/push-ptr-into-deque seems pretty clean. To truly make it smoke i’d also store the length of each string with the pointer, so your
std::deque<>would be ofstruct { char* ptr; size_t len; }. Doing so would eliminate a copious number of unneeded strlen()’s and such. It would also eliminate the need to null-terminate anything.