For the purposes of this example, suppose there exist 2 binary files A and B, each containing a variation of, say, youtube video, where
- A contains a 5 second ad
- B contains no ad
- With the exception for the ad, A contains the same content as B
- Total length of file A is 60 seconds
- Total length of file B is 55 seconds
As a general rule, if we were to compare bits patterns of each file, would we arrive to the same conclusion: files contain 55 seconds worth of common bits?
If we extend the problem further, say to the world of 2 jars, the only difference between which are comments, would it be appropriate to compare the order of bits and based on what we find, determine the degree of likeness?
It’s easy to determine whether files are identical or not. Will the approach of comparing bits help accurately determine the degree to which files are close to one another?
The question is not about video files, but rather a general binary files. I mention video file above for example purposes only.
It depends on the file-format, but in your examples — no, probably not.
Video with and without initial ad: videos are usually encoded by breaking them into small time-blocks, and then encoding and compressing those blocks; if you insert an ad at the beginning, then you will most likely cause the block-transitions to happen at different time offsets within the main video.
Jar-file with and without comments (or with different comments): same story; changing the length of a comment within a file will affect the splitting of the entire file into compressible blocks, so all blocks after an altered comment will be compressed differently. (This is, of course, assuming that the jar-file actually includes the comments. Just because comments were in the source-code, that doesn’t mean the jar-file will have them; that depends on compiler settings and so on.)