Detecting the MIME type of a file with PHP is trivial – just use PEAR’s MIME_Type package, PHP’s fileinfo or call file -i on a Unix machine.
This works really well for binary files and all others that have some kind of “magic bytes” through which they can be detected easily.
What I’m failing at is detecting the correct MIME type of plain text files:
- CSS
- Diff
- INI (configuration)
- Javascript
- rST
- SQL
All of them are identified as “text/plain”, which is correct, but too unspecific for me. I need the real type, even if it costs some time to analyze the file content.
So my question: Which solutions exist to detect the MIME type of such plain text files? Any Libraries? Code snippets?
Note that I neither have a filename nor a file extension, but I have the file content.
If I used ruby, I could integrate github’s linguist. Ohloh’s ohcount is written in C, but has a command line tool to detect the type: ohcount -d $file
What I’ve tried
ohcount
Detects xml and php files correctly, all other not.
Apache tika
Detects xml and html, all other tests files were only seen as text/plain.
Since I didn’t find a proper library, I wrote my own magic file that detects all of my test files properly.
My application first tries my custom magic file for detection and falls back to the normal/system magic file if no type is detected.
The code it on github, see https://github.com/cweiske/MIME_Type_PlainDetect .
The magic file is at data/programming.magic and can be used with
file -f programming.magic /path/to/source