If I’m parsing something like the Apache website, I have one specific problem that seems to be hard to get around. When the parser gets to HTML like:
<li><a href="http://xml.apache.org/" title="XML solutions focused on the web">XML</a></li>
<li><a href="http://xmlbeans.apache.org/" title="XML-Java binding tool">XMLBeans</a></li>
<li><a href="http://xmlgraphics.apache.org/" title="Conversion from XML to graphical output">XML Graphics</a></li>
It fails. The problem seems to be that I’d call the PHP strip_tags function, which correctly removes all HTML tags. The result (if it worked like I had expected) would have been:
XMLXML BeansXML Graphics
This result would be the string generated by taking this text and simply deleting all tags. Fortunately (in one way), strip_tags actually seems to space out the text correctly, giving:
XML XML Beans XML Graphics
Here’s my problem: When I tokenize this string by spaces, (e.g. pass in ” ” to the strtok second argument), these words aren’t split up. The whole website gets tokenized correctly, except for this one spot. Does anyone know what character may be between the words when strip_tags is finished with them so I can tokenize by it?
You could probably try something like
to strip any whitespace before a tag, but i would not rely on this. You should consider using a real HTML parser for this task or think again about if the parser you have in mind really produces what you want.