Question: What is the fastest method to convert a 10 GB BYTE array to a standard string with hex format in Visual C++?
What I am doing: I am using std::fread(…) to read a very large file into a large buffer and then formatting it to hex format and then converting it to std::string. I hope I make sense.
I am currently using this piece of code (not written by me…) which is slow.
std::string ByteToHexFormatStdStr( __in ::BYTE *ByteArray, __in int ArraySize, __in bool AddSpace )
{
::BYTE Byte = NULL;
const char HexCharacters[ 16 ] = { '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F' };
std::string Return = "";
for( ::UINT Index = 0; Index < ArraySize; ++ Index )
{
Byte = ( ::BYTE )( ByteArray[ Index ] & 0xF0 );
Byte = ( ::BYTE )( Byte >> 4 );
Byte = ( ::BYTE )( Byte & 0x0F );
Return += HexCharacters[ ( int )Byte ];
Byte = ( ::BYTE )( ByteArray[ Index ] & 0x0F );
Return += HexCharacters[ ( int )Byte ];
if( AddSpace ) Return += ' ';
}
return ( Return );
}
The problem here is unlikely to be in the routine that converts the data to hexadecimal.
The problem is almost certainly that you’re just using way too much memory. Each byte of input becomes two bytes of hexadecimal. If you add spaces between them, that makes three bytes of output for each one of input.
If you’re starting with 10 gigabytes of input, that means you’re producing 20 or 30 gigabytes of output. Since you’re expanding your destination string incrementally, chances are good that it’s going to resize its buffer and copy the data several times before it gets to the full 30 gigabytes. During a resize/copy operation, it needs memory space for the old copy and the new one, simultaneously. Depending on what factor it uses when it resizes, changes are good that you’re using (or trying to use) somewhere around 60 gigabytes of RAM. Unless you actually have at least 64 gigabytes of physical RAM, that’s almost certainly going to be quite slow.
Chances are pretty good that you’d be better off doing the processing by reading from one file and writing to another. In fairness, this still isn’t going to be extremely fast unless you have really fast hard drives — and by strong preference you read from one and write to another.
Unless you do have that 64Gig of physical RAM, processing from file to file will still almost certainly be faster than using virtual memory though.
For the equivalent of your AddSpace being true, change the second parameter to the
ostream_iteratorfrom""to" ".For this large of files, you might want to do your own file handling though — since you’re apparently running on Windows, for this size of file, you can probably gain quite a bit by using
CreateFiledirectly, and specifyingFILE_FLAG_NO_BUFFERINGto avoid thrashing the cache as you do this. Read in chunks of, say, 4 megabytes or so, transform to another, and write out the result. If you have two (or more) discs so you can read from one as you write to the other, you could also consider using overlapped I/O to allow reading from one file, writing to the other, and processing to happen simultaneously. If you’re only using one disc, that would still allow processing and I/O to happen in parallel, but the processing will be enough faster than the I/O that it probably won’t gain enough to justify the effort.