I’m building a small parser that scrapes web pages and logs the data on them. One of the things to log is the post title of forums. I’m using a XML parser to look through the DOM and get this information, and I’m storing it like this:
// Strip out the post's title
$title = $page->find('a[rel=bookmark]', 0);
$title = htmlspecialchars_decode(html_entity_decode(trim($title->plaintext)));
This works for the most part, but some posts have certain special HTML character codes like – which is dash (-). How would I go about converting these special character codes back into their original strings?
Thanks.
Use html_entity_decode. Here’s a quick example.
You should see…