I have a part of html that is incompletely structured.
Example:
<div id='notrequired'>
<div>
<h3>Some examples :-)</h3>
STL is a library, not a framework.
</div>
</p>
</a>
<a target='_blank' href='http://en.wikipedia.org/wiki/Library_%28computing%29'>Read more</a>;
</div>
<a target='_blank' href='http://en.wikipedia.org/wiki/Library_%28computing%29'>Read more</a>";
As you can notice here I have unexpected </p> and </a> tags.
I tried a snippet of code to remove the <div id='notrequired'> and it works, but unable to handle it precisely.
Here’s the snippet code:
function DOMRemove(DOMNode $from) {
$from->parentNode->removeChild($from);
}
$dom = new DOMDocument();
@$dom->loadHTML($text); //$text contains the above mentioned HTML
$selection = $dom->getElementById('notrequired');
if($selection == NULL){
$text = $dom->saveXML();
}else{
$refine = DOMRemove($selection);
$text = $dom->saveXML($refine);
}
The problem is $dom->saveXML saves as HTML content:
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body>
<a target="_blank" href="http://en.wikipedia.org/wiki/Library_%28computing%29">Read more</a>
</body>
</html>
All I only need is:
<a target='_blank' href='http://en.wikipedia.org/wiki/Library_%28computing%29'>Read more</a>
And not the <HTML> and <BODY> tags.
What am I missing? Any other way of doing it better?
Ok.. I guess I figured out a solution. Approach may not be right, but, it does the job!
As Hakre pointed out that it’s the exact duplicate as innerHTML in PHP’s DomDocument?, It is not exact duplicate but it gave me a hint to use the idea. Thanks for suggestion.
It helped me frame a solution below:
If the input is:
The output, as expected, is:
The solution works but may not optimal. Any thoughts?