I’m using libxml2 to parse HTML:
static htmlSAXHandler simpleSAXHandlerStruct = {
NULL, /* internalSubset */
NULL, /* isStandalone */
NULL, /* hasInternalSubset */
NULL, /* hasExternalSubset */
NULL, /* resolveEntity */
NULL, /* getEntity */
NULL, /* entityDecl */
NULL, /* notationDecl */
NULL, /* attributeDecl */
NULL, /* elementDecl */
NULL, /* unparsedEntityDecl */
NULL, /* setDocumentLocator */
NULL, /* startDocument */
NULL, /* endDocument */
NULL, /* startElement*/
NULL, /* endElement */
NULL, /* reference */
charactersFoundSAX, /* characters */
NULL, /* ignorableWhitespace */
NULL, /* processingInstruction */
NULL, /* comment */
NULL, /* warning */
errorEncounteredSAX, /* error */
NULL, /* fatalError //: unused error() get all the errors */
NULL, /* getParameterEntity */
NULL, /* cdataBlock */
NULL, /* externalSubset */
XML_SAX2_MAGIC, //
NULL,
startElementSAXP, /* startElementNs */
endElementSAXP, /* endElementNs */
NULL, /* serror */
};
The charactersFoundSAX and errorEncounteredSAX functions do get called, but the startElementSAXP and endElementSAXP functions never get called.
If I change the parsing from HTML and parse XML instead (and change all the definitions including ‘html’ to ‘xml’, e.g. into xmlSAXHandler), the functions do get called correctly.
Why is that?
HTML is not namespace aware and hence using just the
startElementNs/endElementNsfunction slots in a SAX parser will result in your observed behaviour.Simple fix: Fill in the
startElement/endElementslots.You can easily use wrappers to match the different signature and then call just the one underlying function in both XML and HTML mode.