We are trying to match a hash that has gone through Oracle’s MD5 hash algorithm using Python. According to their forums everything is encoded in AL21UTF8 prior to hashing:
-- Prior to encryption, hashing or keyed hashing, CLOB datatype is
-- converted to AL32UTF8. This allows cryptographic data to be
-- transferred and understood between databases with different
-- character sets, across character set changes and between
-- separate processes (for example, Java programs).
--
I thought at first that UTF-8 was good enough, but if I do that, my hashes still don’t match. So after additional digging, I found this article which stated from the Oracle’s Database Companion CD installation Guide:
AL32UTF8 is the Oracle Database character set that is appropriate for XMLType data. It is equivalent to the IANA registered standard UTF-8 encoding, which supports all valid XML characters.
Do not confuse the Oracle Database database character set UTF8 (no hyphen) with the database character set AL32UTF8 or with character encoding UTF-8. Database character set UTF8 has been superseded by AL32UTF8. Do not use UTF8 for XML data. UTF8 supports only Unicode version 3.1 and earlier; it does not support all valid XML characters. AL32UTF8 has no such limitation.
So I can’t use UTF-8 and I can’t figure out how to get Python’s codecs module to differentiate between utf-8 and utf8. If I try AL32UTF8, it throws an error. Has anyone else ever encoded in AL32UTF8 in Python?
My codecs code looks like this:
import codecs
sourceFmt = "ascii"
targetFmt = "utf8"
utfFile = "kesa_utf8.dat"
with codecs.open(old, "rU", sourceFmt) as sourceFile:
with codecs.open(utfFile, "w", targetFmt) as targetFile:
targetFile.write(sourceFile.read())
The file itself looks like this:
WC000|IC |KESA |KESA | | | |2012-07-31-15.12.36 |0090| | |\c\n
WC001|100534 |W.47212-0100534 |2012-07-31-15.12.36 | 00000000001270.00|USD|\c\n
WC002|100534 |W.47212-0100534 |Sally |H |Klass |1235 14th St. W. || |Palma Sola ||FL |USA |34209 | | | | | | | | |9412587545 | | |O | | ||20800426|645858741 |SSN | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | || | | | || | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |KESAPC | | | | | |N| | | || | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |\c\n
WC999|1000000000|1000000000|4000000000|
The hash should be 86D993FA7121E3B9EE1657A23345FE21
Anyway, I hash it using hashlib:
import hashlib
with open(path) as f:
data = f.read()
mdhash = hashlib.md5(data)
mdhash = mdhash.hexdigest()
print mdhash
which results in 8421877dd9cdf7235eec47765821998c
It turns out that whatever the client was doing caused the data itself to be changed in such a way that it had “\c\n” line endings and it also would make the lines in the file all the same size via padding (of spaces on the end) AFTER they hashed it. Once we got the client to stop feeding us bad data, we were able to replicate the hash. Thanks for the help though!