Say I have two log files (input.log and output.log) with the following format:
2012-01-16T12:00:00 12345678
The first field is the processing timestamp and the second is a unique ID. I’m trying to find:
- The records from
input.logwhich don’t have a corresponding record for that ID inoutput.log - The records from
input.logwhich have a record for that ID, but the difference in the timestamps exceeds 5 seconds
I have a workaround solution with MySQL, but I’d ideally like to remove the database component and handle it with a shell script.
I have the following, which returns the lines of input.log with an added column if output.log contains the ID:
join -a1 -j2 -o 0 1.1 2.1 <(sort -k2,2 input.log) <(sort -k2,2 output.log)
Example output:
10111 2012-01-16T10:00:00 2012-01-16T10:00:04
11562 2012-01-16T11:00:00 2012-01-16T11:00:10
97554 2012-01-16T09:00:00
Main question:
Now that I have this information, how can I go about computing the differences between the 2 timestamps and discarding those over 5 seconds apart? I hit some problems processing the ISO 8601 timestamp with date (specifically the T) and assumed there must be a better way.
Edit: GNU coreutils supports ISO 8601 since late 2011, not long after this question was asked. This is likely no longer an issue for anyone. See this answer
Secondary question:
Is there perhaps a way to rework the entire approach, for instance into a single awk script? My knowledge of processing multiple files and setting up the correct inequalities for the output conditions was the limiting factor here, hence the approach above.
If you have
GNU awk, then you can try something like this –Test:
Explanation:
NR==FNR{a[$2]=$1;next}We start of by storing the first field in your output.log file in an array indexed on second field. We use
nextto prevent the otherpattern{action}statements from running. UsingNR==FNRallows us to slurp the output.log file completely.!($2 in a) {print $2,$1; next}Once the output.log file is completed. We start with the input.log file. We check if any second field present in input.log file is not present in our array (i.e output.log file). If found we print it. We continue this action until we have printed out all of those fields.
($2 in a) {"date +%s -d " $1 | getline var1; "date +%s -d " a[$2] | getline var2; var3=var2-var1; if (var3 > 4) print $2,$1,a[$2] }In this we look for fields that are present in both files. When we find those fields, we need to put in our logic to calculate the difference. We use the system command to find the date. Now system command by default prints to STDOUT and we have no control over them. So we pipe the output and capture the output using
awkgetlinefunction and store it in a variable (var1 and var2). Once both dates are stored in a variable we do the difference and store in var3, if var3 is found to be > 4, we print it in the format you desire.