I posted yesturday about computing the similarity in percentage the difference between 2 files using the amount of words that appear in 1 but not the other.. this was a bad way of doing the job so i thought a better way would be to make a MD5 or CRC checksum of both files and compute the difference using that .. making the checksum is the easy part but I’m unsure of how to go about determining the difference, I know getting the percentage goes something along these lines:
double sameWordPercentage = (1.0 * n / m) * 100;
Console.WriteLine(Math.Round(sameWordPercentage, 2) + "% Similar");
thanks for any help.. just dont have a clear image of how i’m going to do this, maybe some pseudo code would help also.
Both MD5 and CRC are hashing algorithms that output very different results for similar inputs (and that’s by design).
I think you’d better check some Locality-sensitive hashing algorithm like MinHash, as they recommend in this question.