Looking for articles, documentation or straight head knowledge of how different source control systems differentiate (or detect) the type of file (binary vs. text). Of particular interest is how Git does it vs Mercurial.
Do they look at:
File extensions?
File signatures or content (ie. is this file UTF8)?
A mix of things?
SVN:
When you first add or import a file into Subversion, the file is examined to determine if it is a binary file. Currently, Subversion just looks at the first 1024 bytes of the file; if any of the bytes are zero, or if more than 15% are not ASCII printing characters, then Subversion calls the file binary. This heuristic might be improved in the future, however.
http://subversion.apache.org/faq.html#binary-files
Git works in a similar way. Git usually guesses correctly whether a blob contains text or binary data by examining the beginning of the contents – It checks for any occurrence of a zero byte (NUL “character”) in the first 8000 bytes.
http://git-scm.com/docs/gitattributes
And from Git source:
http://git.kernel.org/?p=git/git.git;a=blob;f=xdiff-interface.c;h=0e2c169227ad29b5bf546c6c1b97e1a1d8ed7409;hb=HEAD
And @tonfa makes a good point that “Also note that the only place where it cares about a file being text vs. binary is for diplaying diff, and for doing merges. The storage format does not care about it.”