I am after a general purpose unspecialised plain text file extractor.
Firstly before people shout look at Apache Tika – my response is that it only supports some popular binary file formats like Office, BMPs etc.
Back to the problem – Many binary files have text strings embedded in them, which i would like to extract without the binary byte noise. THis would mean it could find simple text string sequences in exes and so on with the result only holding ascii words. I tried googling but could not find anything that did this. My basic idea is if a file is not handled by TIKA this simple binary file handler would try its best to find these text strings.
I ended up writing my code class to solve my problem.
Important features/considerations.