I need a unique ID for a document read into a c++ program that will carry into a database. The ID needs to be the same regardless of whether the document it is tied to is run through the program first, by itself, or in the middle of a stack of other documents. So that I can honor overwrites of the document in the database.
I considered using the ASCII value of the document name such as
Employee Spec Page.doc 358
but it has the same value as
Answer Warnings.doc 358
Which means that when I run the second doc in my program, it overwrites the presence of the first doc.
The ID must be a number and needs to be unique but it must be consistently regeneratable without having to cross-reference the database itself (since this program runs separately from the database import program)
Hoping someone has some ideas because I’m stumped.
EDIT: I tried to use MD5 to convert “Employee Spec Page.doc” and “Answer Warnings.doc” and got the following char representations:
Answer Warnings: 2dcb2503c48f5472bfdbafe28d565a9d
Employee Spec Page: a9be4c1428c11b406072c0bd3dab2dee
However, when I then convert the char* into an unsigned int
char* docID = md5.digestString(pDocument->m_csDocumentName.GetBuffer());
pDocument->m_csDocID.Format("%i",(unsigned int)docID);
I get both being:
Answer Warnings: 1634456
Employee Spec Page: 1634456
I got the md5 class from here: http://bobobobo.wordpress.com/2010/10/17/md5-c-implementation/
What am I doing wrong? I need it to be an integer or else I won’t be able to store the ID in the database.
what you need is a hash function generating a number big enough to avoid collisions. MD5 (as piokuc above mentioned) should be ok
you can generate shorter keys by simply truncating the MD5 result. but be aware that you increase the chance of collisions. 128 bit has more than 10^38 different keys; 64 bit has more than 10^19; 32 bit has more than 10^9 (4.294.967.296). so 32 bit is near a lottery chance to have a collision between two specific documents. for 10.000 documents you have a chance of 1% thave at least one collision. the acceptance of a certain key length depends on your requirements. you can of course implement collision detection and collision resolution.
if your ‘database’ allows only a short key you have to implement collision resolution. for an idea of how to do that see Hash_table Collision_resolution
from Wikipedia: ’10^−18 to 10^−15 is the uncorrectable bit error rate of a typical hard disk. In theory, MD5 hashes or UUIDs, being 128 bits, should stay within that range until about 820 billion documents’
to your concrete library:
if you look into the md5 header file, there is
so you can retrieve the binary digest any time