I have started learning C# recently. MSDN has an example where you make a RSS application by directly getting the XML file, so I tried something of my own, and like most of the times, I didn’t got it right. Put the sigh sound here.
As the pages are HTML, I tried looking for HTML to XHTML converters, and I found this one (which is pretty interesting) called HTML-Cleaner.
It replaces unwanted tags with a <dd> tag, but I wish to skip those tags, so I made a modification of my own:
public override bool Read()
{
bool status = base.Read();
if( status )
{
if( base.NodeType == XmlNodeType.Element )
{
dowrite = false;
// Got a node with prefix. This must be one of those "<o:p>"
// or something else. Skip this node entirely. We want prefix-
// less nodes so that the resultant XML requires no namespace.
foreach (string line in AllowedTags)
{
if (base.Name == line ||
(base.Name == "html" && first == false))
{
dowrite = true;
first = true;
}
}
if( base.Name.IndexOf(':') > 0 )
dowrite=false;
if(!dowrite)
base.Skip();
}
}
return status;
}
The problem is it only prints one <dd> tag and nothing else. Even if allowed tags are present, it skips them.
Why is this happening? Any help will be greatly appreciated. If you have alternative approaches, please feel free to suggest them.
EDIT : anyway to achieve this???
It looks like the
Readmethod returns XML nodes, not tags, so the entire contents of any not matching node will be dropped.If the input is a typical HTML file, at some point during the recursive
Readmethod the ‘head’ element will be found. This is not in the AllowedTags list so it, and all its descendent nodes will beSkipped.The same applies to the
bodyelement. It and all its descendents will be skipped.That leaves the
htmlelement, which matches in your code and so gets inserted into the XML DOM.Since
htmlis not in the AllowedTags list, during theHTMLWriterphase, the html tags will get converted toddtags, which is what you describe as your output.I actually don’t go a bundle on the html2xhtmlcleaner code, but I think you need to adapt the writer code rather than the reader code to achieve what you are trying to do.