I need to extract requests from a log file that look like this :
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<vehicleRegistration>
.... XML in between ....
.... XML in between ....
.... XML in between ....
.... XML in between ....
... at nth line there is line like this <vehicle id="2312313"></vehicle>
.... XML in between ....
.... XML in between ....
</vehicleRegistration>
The important issue is that vehicleRegistration can be 5 lines and sometimes 17, its changeable. It is where my current grep has failed, I used :
grep -A 13 "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?>" vehicle.log
Also another issue is that, sometimes a request can be sent 2 or more times because the service might be unavailable for some reason, so there might be same multiple requests in the file.
I should also rule out duplicate requests, the way to know that the request is duplicate by comparing nth line(not the last line) <vehicle id="2312313"></vehicle>, if vehicle id repeated than its a duplicate.
What is the way you would solve this? Suggestions, code, pseudo-code, anything is welcome.
EDIT :
Log file is not an xml file, its just a file containing some small percentage of xml requests and I can’t parse it as XML
EDIT II :
I extracted only the vehicle registration part, using @eugene y one line command perl -nle 'm{<vehicleRegistration>} .. m{</vehicleRegistration>} and print' logfile , how can I get rid of duplicates, those nodes that have same vehicle id, I want to keep only one copy of those.
Use XPath to recover XML element nodes. There are lots of frameworks for various modern scripting languages.
With Perl, you might do something like:
If you need to, parse your log file to extract the XML document portion, and then run the XPath expression on it to recover the element and data you want.