If I’m parsing something like the Apache website, I have one specific problem that

Question

0

Asked: May 25, 20262026-05-25T18:28:26+00:00 2026-05-25T18:28:26+00:00

If I’m parsing something like the Apache website, I have one specific problem that

0

If I’m parsing something like the Apache website, I have one specific problem that seems to be hard to get around. When the parser gets to HTML like:

<li><a href="http://xml.apache.org/" title="XML solutions focused on the web">XML</a></li> 
<li><a href="http://xmlbeans.apache.org/" title="XML-Java binding tool">XMLBeans</a></li> 
<li><a href="http://xmlgraphics.apache.org/" title="Conversion from XML to graphical output">XML Graphics</a></li>

It fails. The problem seems to be that I’d call the PHP strip_tags function, which correctly removes all HTML tags. The result (if it worked like I had expected) would have been:

XMLXML BeansXML Graphics

This result would be the string generated by taking this text and simply deleting all tags. Fortunately (in one way), strip_tags actually seems to space out the text correctly, giving:

XML XML Beans XML Graphics

Here’s my problem: When I tokenize this string by spaces, (e.g. pass in ” ” to the strtok second argument), these words aren’t split up. The whole website gets tokenized correctly, except for this one spot. Does anyone know what character may be between the words when strip_tags is finished with them so I can tokenize by it?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-25T18:28:27+00:00

Editorial Team

2026-05-25T18:28:27+00:00Added an answer on May 25, 2026 at 6:28 pm

You could probably try something like

$html = preg_replace('/(>?)\s+</', '\1<', $html);

to strip any whitespace before a tag, but i would not rely on this. You should consider using a real HTML parser for this task or think again about if the parser you have in mind really produces what you want.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

If I’m parsing something like the Apache website, I have one specific problem that

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply