I’m searching through input to pull out specific info about each record. The sad thing is that each record is spread out over multiple lines, e.g. (simplified excerpt)
01238584 (other info) more info, more info
[age=81][otherinfo][etc, etc]
The only thing I really care about is the identifier and the age (01238584 and 81, in the example). To be crystal-clear, the only regex I can reliably search for in the input to get close to these two lines is
\[age=[0-9]+\]
… and of course I want to print out that age along with the identifying record number from the line above it, e.g.
01238584 81
With all my sysadmin shell experience and decent awk mastery, I haven’t come up with a solution yet. I can of course use grep -B1 to get each set of lines, but then what? I always use awk for these kinds of things… but associated data is always in the same line. sigh This is definitely beyond my current awk skills.
Thanks for reading. Got any pointers?
EDIT
I’m going with Charlie’s suggestion and changing awk’s record separator, which I had never done before. It’s not pretty, but neither is the input. Job is done.
egrep -B1 '\[age=[0-9]+\]' inputfile |
awk '
BEGIN{ RS = "--" }
{ printf "%s %s\n", $1, gensub(/.*\[age=([0-9]+)\].*/, "\\1", 1) }'
Can you show more of the input file? For example, if the data records are separated by blank lines, you can change the record separator using the RS special variable in Awk to have it treat multiple lines as one record. (See, eg, http://www.staff.science.uu.nl/~oostr102/docs/nawk/nawk_19.html)
In any case, I’d be tempted to do something that puts all your data records onto one line or in one logical record.
If you can’t do that but you know the record ID is always on the line before the age tag, then it’s easy to do in Python with readlines, which reads the whole file into a list of lines, something like this
or, of course, you can always just keep the previous line in memory in Awk