I work with XML files containing book data. When investigating data corruption issues I often have to extract the whole records which include a particular string.
I am struggling to do this with my very limited knowledge of bash scripting and total lack of knowlwdge of other programming languages such as perl.
I have standard user access to a Linux box (RHEL 4) with no prospect of getting permission to install anything not already present.
Using standard tools/languages available on this box, can anyone explain how I might look for a particular string and extract any whole records from the file which might contain it?
E.g. to extract the whole records which contain ‘Smith’ from the following file.
Example data:
<File>
<Product>
<Ref>1</Ref>
<Title>My Life</Title>
<Series>Life Stories</Series>
<Author>John Smith</Author>
<Price>5.99</Price>
</Product>
<Product>
<Ref>2</Ref>
<Title>A Story</Title>
<Author>Fred Bloggs</Author>
<Price>16.99</Price>
</Product>
<Product>
<Ref>3</Ref>
<Title>Book 1</Title>
<Author>Jane Smith</Author>
<Price>10.99</Price>
</Product>
</File>
Required output:
<Product>
<Ref>1</Ref>
<Title>My Life</Title>
<Series>Life Stories</Series>
<Author>John Smith</Author>
<Price>5.99</Price>
</Product>
<Product>
<Ref>3</Ref>
<Title>Book 1</Title>
<Author>Jane Smith</Author>
<Price>10.99</Price>
</Product>
That is to say everything between the <Product> </Product> tags for the records containing the string ‘Smith’.
The records may contain different numbers of tags but will always be enclosed in <Product> </Product> tags.
I appreciate the perfect result may not be possible every time without using more specialist tools but I simply don’t have them available to me. Anything which gets me close would be great.
I’m thinking the script would read each record in the file, look for the string within each record in turn and redirect those records which match to an output. However, I am struggling to find the answer anywhere.
Many thanks for any help you can offer.
this should work for your example: