I have a perl script (or any executable) E which will take a file foo.xml and write a file foo.txt. I use a Beowulf cluster to run E for a large number of XML files, but I’d like to write a simple job server script in shell (bash) which doesn’t overwrite existing txt files.
I’m currently doing something like
#!/bin/sh
PATTERN="[A-Z]*0[1-2][a-j]"; # this matches foo in all cases
todo=`ls *.xml | grep $PATTERN -o`;
isdone=`ls *.txt | grep $PATTERN -o`;
whatsleft=todo - isdone; # what's the unix magic?
#tack on the .xml prefix with sed or something
#and then call the job server;
jobserve E "$whatsleft";
and then I don’t know how to get the difference between $todo and $isdone. I’d prefer using sort/uniq to something like a for loop with grep inside, but I’m not sure how to do it (pipes? temporary files?)
As a bonus question, is there a way to do lookahead search in bash grep?
To clarify/extend the problem:
I have a bunch of programs that take input from sources like (but not necessarily) data/{branch}/special/{pattern}.xml and write output to another directory results/special/{branch}-{pattern}.txt (or data/{branch}/intermediate/{pattern}.dat, e.g.). I want to check in my jobfarming shell script if that file already exists.
So E transforms data/{branch}/special/{pattern}.xml->results/special/{branch}-{pattern}.dat, for instance. I want to look at each instance of the input and check if the output exists. One (admittedly simpler) way to do this is just to touch *.done files next to each input file and check for those results, but I’d rather not manage those, and sometimes the jobs terminate improperly so I wouldn’t want them marked done.
N.B. I don’t need to check concurrency yet or lock any files.
So a simple, clear way to solve the above problem (in pseudocode) might be
for i in `/bin/ls *.xml`
do
replace xml suffix with txt
if [that file exists]
add to whatsleft list
end
done
but I’m looking for something more general.
This assumes that you want a single
jobserve Ecall with all the remaining files as arguments; it’s rather unclear from the specification if such is the case.Note the use of extended globs rather than parsing ls, which is considered very poor practice.
To transform input to output names without using anything other than shell builtins, consider the following: