This might be a wide answer but i would like to see answers and discuss this thread with SO users.
So far i guess a Audio File(WAV) has a Sample Rate which could be 44000 or 48000 (i’ve seen most these 2), and from that we can determine that a single Second into a File (second 00:00:01) has exactly 44000 Integer Values which means here we have an Int[], so if an Audio File Duration is 5 Seconds it has 5 * 44000 Integers (or 5 Samples).
So my question is, how can we calculate the difference (or similarity) of content between two time spans, like Audio1.wav and Audio2.wav at 00:00:01 with same Sample Rate.
There are couple assumptions in your reasoning:
1. The file is the raw uncompressed (PCM encoded) data.
2. There is only one channel (mono).
It’s better to start from reading some format descriptions and sample implementations, then search for some audio comparison algorithms (1, 2, 3).
Linked Q: Compare two spectogram to find the offset where they match algorithm