I have some .doc binary files stored in my database and i would like to now search them all (without converting them to .doc) to see which one contains the word “hello” for instance.
Is there any way to do this search in the binary file?
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
Not without a lot of pain, as far as I can tell. According to Wikipedia, Microsoft has within the past few years finally released the .doc specification. So you could create a parser based on the spec if you have the time, assuming all of your documents are in the same version of the .doc format.
Of course you could just search for the text you’re looking for amid all the binary data, on the assumption that the actual text is stored as plain text. But even if that assumption were true, how could you be sure that the plain text you found was the actual document text, and not some of the document meta data that’s also stored in plain text? And there’s always the off chance that the binary data will match your text pattern.
If the Word libraries are available to you, I would go that route. If not, a homegrown parser may be your least bad option.