Hi I am working on scraping the XML file. For HTML I have used

Question

0

Asked: June 8, 20262026-06-08T05:44:29+00:00 2026-06-08T05:44:29+00:00

Hi I am working on scraping the XML file. For HTML I have used

0

Hi I am working on scraping the XML file. For HTML I have used scrapy and for XML I decided to parse it by using xml.sax.

Following is an example code (don’t treat it as a real example) just to view my doubt:

from xml.sax.handler import ContentHandler
import xml.sax

xmlFilePath = 'users/documents/jobstext.xml'

try:
    parser = xml.sax.make_parser( )
    parser.parse(open(xmlFilePath))

except (xml.sax.SAXParseException), e:
        print "*** PARSER error: %s" % e
        print e,"What is the error actually >>>>"

Following is XML code:

<?xml version="1.0" encoding="utf-8"?>
<jobs>
  <reader><![CDATA[Identity Group]]></reader>
  <readerUrl><![CDATA[http://www.example.com]]></readerUrl>

  <job>
    <title><![CDATA[Architect - OT]]></title>
    <category><![CDATA[LTC/SNF]]></category>
    <jobId><![CDATA[139693]]></jobId>
    <specialization><![CDATA[LTC/SNF]]></specialization>
    <positionType><![CDATA[Travel]]></positionType>
    <description><![CDATA[<DIV>OT&nbsp;needed for a SNF in&nbsp;Oregon.&nbsp; Oregon is a dramatic land of many changes. From the rugged Oregon seacoast, the high mountain passes of the country for Travel Allied Professionals and Travel Nurses. Our clients are among the most prestigious healthcare facilities in the country.</DIV>
<DIV>&nbsp;</DIV>
 </description>
<P style="MARGIN: 0in 0in 0pt" class=MsoNormal><FONT size=3><SPAN style="FONT-FAMILY: Symbol; COLOR: black; mso-ascii-font-family: 'Times New Roman'">�</SPAN><SPAN style="COLOR: black"><FONT face="Times New Roman"><SPAN style="mso-spacerun: yes">&nbsp; </SPAN>Position will manage 24 ED Rooms with 24/7 accountability<o:p></o:p></FONT></SPAN></FONT></P>
<P style="MARGIN: 0in 0in 0pt" class=MsoNormal><FONT size=3><SPAN style="FONT-FAMILY: Symbol; COLOR: black; mso-ascii-font-family: 'Times New Roman'">�</SPAN><SPAN style="COLOR: black"><FONT face="Times New Roman"> <SPAN style="mso-spacerun: yes">&nbsp;</SPAN>55 FTEs <o:p></o:p></FONT></SPAN></FONT></P>
  </job>
</jobs>

Result:

*** PARSER error: users/documents/jobstext.xml:13:150: not well-formed <invalid token>
users/documents/jobstext.xml:13:150: not well-formed <invalid token> What is the error actually >>>>

What is happening when the execution reaches <p> tag and index 150 its displaying an error invalid token? I am expecting this becuase of ? tag as you can see this in the above error .

So can anyone please let me know how to solve this error of not well-formed <invalid token> in xml parsing,

If I explained in a wrong format, I am sorry, but hope I explained the concept well.

Edited Code:

<P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 10pt; COLOR: black; FONT-FAMILY: Arial">THE MOST COMPETITIVE RATES IN NM .....<o:p></o:p></SPAN></P>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 10pt; COLOR: black; FONT-FAMILY: Arial">Busy <?xml:namespace prefix = st1 ns = "urn:schemas-microsoft-com:office:smarttags" /><st1:place w:st="on"><st1:PlaceName w:st="on">Acute</st1:PlaceName> <st1:PlaceName w:st="on">Care</st1:PlaceName> <st1:PlaceType w:st="on">Hospital</st1:PlaceType></st1:place> needs Occupational Therapists.&nbsp; Experience with </SPAN><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">Ortho, Neuro, vestibular balance, aquatic a plus!<SPAN style="COLOR: black">&nbsp; New grads welcome.<SPAN style="mso-spacerun: yes">&nbsp; </SPAN>Signon Bonus and help with relocation.<SPAN style="mso-spacerun: yes">&nbsp; </SPAN>For more details please call or email Carole 800 995 2673 X1329 or <A href="mailto:cs@coremedicalgroup.com"><SPAN style="mso-bidi-font-weight: bold; mso-bidi-font-size: 12.0pt">cs@coremedicalgroup.com</SPAN></A><o:p></o:p></SPAN></SPAN></P>

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-08T05:44:30+00:00

Editorial Team

2026-06-08T05:44:30+00:00Added an answer on June 8, 2026 at 5:44 am

Since the question has changed…

XML attributes must be quoted.

For example: class=MsoNormal should be class="MsoNormal"

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Hi I am working on scraping the XML file. For HTML I have used

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply