I have to write a script that will count the number of xml tags(say Code) in a xml file using shell script. XML file can be anyone of the following formats:
Format #1:
<Code>value1</Code> <Code>value2</Code>
Format #2:
<Code Attr1=va>value1</Code> <Code Attr1=va
Attr2=va>value1</Code>
Format #3:
<Code>value1</Code><Code>value2</Code> (All Codes can be in
a single line or multiple lines)
Format #4
<Code Attr1=va>value1</Code><Code Attr2=va>value1</Code>
Format #5:
<Cod
e>Value1</Code
<Code Attr=1> </C
ode>
In short XML file can in any format and can have new lines anywhere.
Please help me, I need to do this soon..
Thanks in advance.
Regular expressions are a bad way to parse XML, using some sort of XML parser is better.
If you really want to use sed/awk/shell/grep etc, the first thing I can think of is:
I don’t know awk very well, but I’m sure there are awk ninjas out there who can do it more elegantly than this.
It only counts occurences of
<Code>(& variations) but not the closing tag, so if you have (for example) 10<Code>in your file but only 9</Code>, it will return 10 and not 9.Basically:
cat tst | xargscats ‘tst’ to the shell all on one line (so I don’t have to worry about new lines);grep -o '<\s*C\s*o\s*d\s*e[^>]*>'prints all matches of<Code{optional other stuff}>where you can have newlines/spaces in between all letters ofCode(the-oprints just the matches to the regex, one per line);wc -lcounts the lines.Try each bit successively to see what I mean.
For me
tstwas just a copy-paste of what you have above.