I’m planning a .NET project that involves automated upload of files from the most diverse types, from various distributed clients to a constellation of servers, and sometimes the file extension may not match the real file type (long story).
Using HTTP compression will not always be an option, and in this project case, is preferrable to spend more client processing than bandwidth or server storage. But it would be really better if we could skip the compression process if we could determine if the compression would give feasible results.
I know that there is no “right answer”, but we would appreciate any ideas.
Given what you say about extensions I can see a couple of ways
First: Can you determine the type of the file with out using the extension? lots of file types have standard headers so you could parse the headers and determine is this is one of the dozen of so of common file type you have implemented filters for.
Second: A simpler hurestic would be to grab say 100 bytes from the middle of the file and see if this is standard ascii e.g. each byte has a value between 9 and 126. This will be wrong a given percent of time, will not work on text in a lot of languages and will not work on unicode text.