So I have about 4,000 word docs that I’m attempting to extract the text from and insert into a db table. This works swimmingly until the processor encounters a document with the *.doc file extension but determines the file is actually an RTF. Now I know POI doesn’t support RTFs which is fine, but I do need a way to determine if a *.doc file is actually an RTF so that I can choose to ignore the file and continue processing.
I’ve tried several techniques to overcome this, including using ColdFusion’s MimeTypeUtils, however, it seems to base its assumption of the mimetype on the file extension and still classifies the RTF as application/msword. Is there any other way to determine if a *.doc is an RTF? Any help would be hugely appreciated.
With CF8 and compatible:
For earlier versions:
Update: A better CF8/compatible answer. To avoid loading the whole file into memory, you can do the following to load just the first few characters:
Based on the comments:
Here’s a very quick way how you might do a generate ‘what format is this’ type of function. Not perfect, but it gives you the idea…
Of course, worth pointing out that all this wont work on ‘headerless’ formats, including many common text-based ones (CFM,CSS,JS,etc).