My test equipment generates large text files which tend to grow in size over a period of several days as data is added.
But the text files are transferred to a PC for backup purposes daily, where they’re compressed with gzip, even before they’ve finished growing.
This means I frequently have both file.txt and a compressed form file.txt.gz where the uncompressed file may be more up to date than the compressed version.
I decide which to keep with the following bash script gzandrm:
#!/usr/bin/bash
# Given an uncompressed file, look in the same directory for
# a gzipped version of the file and delete the uncompressed
# file if zdiff reveals they're identical. Otherwise, the
# file can be compressed.
# eg: find . -name '*.txt' -exec gzandrm {} \;
if [[ -e $1 && -e $1.gz ]]
then
# simple check: use zdiff and count the characters
DIFFS=$(zdiff "$1" "$1.gz" | wc -c)
if [[ $DIFFS -eq 0 ]]
then
# difference is '0', delete the uncompressed file
echo "'$1' already gzipped, so removed"
rm "$1"
else
# difference is non-zero, check manually
echo "'$1' and '$1.gz' are different"
fi
else
# go ahead and compress the file
echo "'$1' not yet gzipped, doing it now"
gzip "$1"
fi
and this has worked well, but it would make more sense to compare the modification dates of the files, since gzip does not change the modification date when it compresses, so two files with the same date are really the same file, even if one of them is compressed.
How can I modify my script to compare files by date, rather than size?
It’s not entirely clear what the goal is, but it seems to be simple efficiency, so I think you should make two changes: 1) check modification times, as you suggest, and don’t bother comparing content if the uncompressed file is no newer than the compressed file, and 2) use
zcmpinstead ofzdiff.Taking #2 first, your script does this:
which will perform a full diff of potentially large files, count the characters in diff’s output, and examine the count. But all you really want to know is whether the content differs.
cmpis better for that, since it will scan byte by byte and stop if it encounters a difference. It doesn’t take the time to format a nice textual comparison (which you will mostly ignore); its exit status tells you the result.zcmpisn’t quite as efficient as rawcmp, since it’ll need to do an uncompress first, butzdiffhas the same issue.So you could switch to
zcmp(and remove the use of a subshell, eliminatewc, not invoke[[, and avoid putting potentially large textual diff data into a variable) just by changing the above two lines to this:To go a step further and check modification times first, you can use the
-nt(newer than) option to thetestcommand (also known as square bracket), rewriting the above line as this:which says that if the uncompressed version is no newer than the compressed version OR if they have the same content, then $1 is already gzipped and you can remove it. Note that if the uncompressed file is no newer,
zcmpwon’t run at all, saving some cycles.The rest of your script should work as is.
One caveat: modification times are very easy to change. Just moving the compressed file from one machine to another could change its modtime, so you’ll have to consider your own case to know whether the modtime check is a valid optimization or more trouble than it’s worth.