So I found a script online for xml parsing in linux that I am wanting to use, and I was hoping to get some help with understanding how the script works, and how to edit it for my own use.
Here is the script (credit)
#!/bin/bash
cat $1 | awk '
START { pos=1; xml=$0 len=length(xml); endp=1 }
{ while(pos <= len) {
if(substr(xml,pos,7) == "<title>")
{
pos=pos+7;
endp=pos;
while((substr(xml,endp,8) != "</title>") && (endp < len))
{
endp++;
}
print " ",substr(xml,pos,endp-pos)," * ";
pos=endp+7;
}
pos++; } }'
Here is a simplified sample of the xml data I will be using
I have already gotten rid of the extra characters on both sides of the tags and made a few other adjustments by changing the script to this
#!/bin/bash
cat $1 | awk '
START { pos=1; xml=$0 len=length(xml); endp=1 }
{ while(pos <= len) {
if(substr(xml,pos,16) == "<sport><![CDATA[")
{
pos=pos+16;
endp=pos;
while((substr(xml,endp,11) != "]]></sport>") && (endp < len))
{
endp++;
}
print "",substr(xml,pos,endp-pos),"";
pos=endp+10;
}
pos++; } }'
So using this script leaves me with a plain text file with this result
Women's Soccer
Men's Soccer
Women's Soccer
Ultimately I’d like to have a script output the following
Women's Soccer Away @ South Carolina (Exhibition) at 7:00 PM
Men's Soccer Home vs. Ohio State at 7:00 PM
Women's Soccer Away @ William and Mary at 7:00 PM
For those wondering, this is the shell that calls the parse script (ignore file names and locations)
wget -O rss.xml http://en-us.fxfeeds.mozilla.com/en-US/firefox/headlines.xml
~dsl/bin/rssparse! rss.xml > headlines_$$.tmp
cd /tmp/ldmtrx
split --lines=30 /tmp/headlines_$$.tmp ldmtrxnews
cd /tmp
rm headlines_$$.tmp rss.xml
While it would be greatly appreciated, I don’t expect anyone to complete this task for me, I’d just really like some tips and help getting started. I’m not sure how to run this script on a different element and then print both elements (for example <sport> and <homeaway>) I could run the script again, but then the elements would be printed on different lines.
Lastly, I’d like to know how to exclude all data that does not have a <date> matching today’s date. Thanks for your help.
You must know that your example lacks of validation. It needs some tweaks
check this pastie instead of that pastie
then using xmlstarlet you can superseed all that this script does.
That outputs:
And when the output is what you need you can use -C with xmlstarlet to show an xml template you can source in any language you need that particular parsing.