I’m using a shell script to automatically create a zipped backup of various directories every hour. If I haven’t been working on any of them for quite some time, this creates alot of duplicate archives. MD5 hashes of the files don’t match, because they do have different filenames & creation dates etc.
Other than making sure there won’t be duplicates in the first place, another option is checking if filesizes match, but that doesn’t necesseraly mean they are duplicates.
Filenames are done like so;
Qt_2012-03-15_23_00.tgz
Qt_2012-03-16_00_00.tgz
So maybe it would be an option to check if files have identical filesizes consequently (if that’s the right word for it.)
Pseudo code:
int previoussize = 0;
String previouspath = null;
String Filename = null;
String workDir = "/path/to/workDir ";
String processedDir = "/path/to/processedDir ";
//Loop over all files
for file in workDir
{
//Match
if(file.size() == previoussize)
{
if(previouspath!=null) //skip first loop
{
rm previouspath; //Delete file
}
}
else //No Match
{
/*If there's no match, we can move the previous file
to another directory so it doesn't get checked again*/
if(previouspath!=null) //skip first loop
{
mv previouspath processedDir/Filename;
}
}
previoussize = file.size();
previouspath = file.path();
Filename = file.name();
}
Example:
Qt_2012-03-15_23_00.tgz 10KB
Qt_2012-03-16_00_00.tgz 10KB
Qt_2012-03-16_01_00.tgz 10KB
Qt_2012-03-16_02_00.tgz 15KB
Qt_2012-03-16_03_00.tgz 10KB
Qt_2012-03-16_04_00.tgz 10KB
If I’m correct this would only delete the first 2 and the second to last one. The third and the fourth should be moved to the processedDir.
So I guess I have 2 questions:
-
Would my pseudo code work the way I intend it to? (I find these things rather confusing.)
-
Is there a better/simpler/faster way? Because even though the chance of accidentally deleting non-identicals like that is very small, it’s still a chance.
I can think of a couple of alternatives:
Deploy a version control system such as Git, Subversion, etc, and write a script that periodically checks in any changes. This will save a lot of space because only files that have actually changed get saved, and because changes to text files will be stored as diffs.
Use an incremental backup tool. This article lists a number of alternatives.
Normal practice is to put the version control system / backups on a different machine, but you don’t have to do that.