So I found a script online for xml parsing in linux that I am

Question

0

Asked: June 9, 20262026-06-09T05:10:15+00:00 2026-06-09T05:10:15+00:00

So I found a script online for xml parsing in linux that I am

0

So I found a script online for xml parsing in linux that I am wanting to use, and I was hoping to get some help with understanding how the script works, and how to edit it for my own use.

Here is the script (credit)

#!/bin/bash

cat $1 | awk '

START {    pos=1;    xml=$0    len=length(xml);    endp=1 }

{    while(pos <= len)    {
      if(substr(xml,pos,7) == "<title>")
      {
         pos=pos+7;
         endp=pos;
         while((substr(xml,endp,8) != "</title>") && (endp < len))
         {
            endp++;
         }
         print "   ",substr(xml,pos,endp-pos)," * ";
         pos=endp+7;
      }
      pos++;    } }'

Here is a simplified sample of the xml data I will be using

I have already gotten rid of the extra characters on both sides of the tags and made a few other adjustments by changing the script to this

  #!/bin/bash

    cat $1 | awk '

    START {    pos=1;    xml=$0    len=length(xml);    endp=1 }

    {    while(pos <= len)    {
          if(substr(xml,pos,16) == "<sport><![CDATA[")
          {
             pos=pos+16;
             endp=pos;
             while((substr(xml,endp,11) != "]]></sport>") && (endp < len))
             {
                endp++;
             }
             print "",substr(xml,pos,endp-pos),"";
             pos=endp+10;
          }
          pos++;    } }'

So using this script leaves me with a plain text file with this result

Women's Soccer
Men's Soccer
Women's Soccer

Ultimately I’d like to have a script output the following

Women's Soccer Away @ South Carolina (Exhibition) at 7:00 PM
Men's Soccer Home vs. Ohio State at 7:00 PM
Women's Soccer Away @ William and Mary at 7:00 PM

For those wondering, this is the shell that calls the parse script (ignore file names and locations)

wget -O rss.xml http://en-us.fxfeeds.mozilla.com/en-US/firefox/headlines.xml
        ~dsl/bin/rssparse! rss.xml > headlines_$$.tmp
        cd /tmp/ldmtrx
        split --lines=30 /tmp/headlines_$$.tmp ldmtrxnews
        cd /tmp
        rm headlines_$$.tmp rss.xml

While it would be greatly appreciated, I don’t expect anyone to complete this task for me, I’d just really like some tips and help getting started. I’m not sure how to run this script on a different element and then print both elements (for example <sport> and <homeaway>) I could run the script again, but then the elements would be printed on different lines.

Lastly, I’d like to know how to exclude all data that does not have a <date> matching today’s date. Thanks for your help.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-09T05:10:16+00:00

You must know that your example lacks of validation. It needs some tweaks

check this pastie instead of that pastie

then using xmlstarlet you can superseed all that this script does.

$ wget --output-document - http://pastie.org/pastes/4408130/download | xmlstarlet sel -t -m rss/channel/item -v sport -o ' Away @ ' -v opponent -o ' at ' -v time -na

That outputs:

Women's Soccer Away @ South Carolina (Exhibition) at 7:00 PM
Men's Soccer Away @ Ohio State (Exhibition) at 7:00 PM
Women's Soccer Away @ William and Mary at 7:00 PM

And when the output is what you need you can use -C with xmlstarlet to show an xml template you can source in any language you need that particular parsing.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

So I found a script online for xml parsing in linux that I am

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply