Hi, presently I am using xml.sax.handler to parse xml files.
Below is my file.xml code:
<?xml version="1.0" encoding="utf-8"?>
<sturp>
<gear>
<UL>
<LI><I>Free Private Housing or a Generous Housing Allowance</I></LI>
<LI><I>$50K in Free Life Insurance coverage</I></LI>
</UL>
<P style="MARGIN: 0in 0in 0pt" class="MsoNormal"><FONT size="3"><SPAN style="FONT-FAMILY: Symbol; COLOR: black; mso-ascii-font-family: 'Times New Roman'">�</SPAN><SPAN style="COLOR: black"><FONT face="Times New Roman"><SPAN style="mso-spacerun: yes"> </SPAN>Position will manage 24 ED Rooms with 24/7 accountability<o:p></o:p></FONT></SPAN></FONT></P>
<DIV> </DIV>
</gear>
</sturp>
below is my code
xmlFilePath = 'user/documents/file.xml'
try:
parser = xml.sax.make_parser( )
handler = FeedHandler( conn, clientSiteId, clientId, documentElementName, jobElementName )
handler.setMapping( mapping )
parser.setContentHandler(handler)
parser.setEntityResolver(handler)
parser.parse(open(xmlFilePath))
except (xml.sax.SAXParseException), e:
print "*** PARSER error: %s" % e
output:
*** PARSER error: user/documents/file.xml:8:150: not well-formed <invalid token>
*** PARSER error: user/documents/file.xml:9:1: not well-formed <invalid token>
Actually the source xml file given to me is not in valid xml format, but i need to parse it.
How to ignore and � from the xml file (also should escape all the errors and non valid xml tokens) before feeding it to the parser in the above code.
Thanks in advance……..
If you’re simply looking to replace
&[a-z]+;entities from your input, you could use my hacked up solution below. But note, you should still give the parser a valid xml file, if you want it to work correctly.For the parser:
Result
Untested code.