I want to get the text of a page using HTMLAgilityPack. I have some code for this:
HtmlAgilityPack.HtmlWeb TheWebLoader = new HtmlWeb();
HtmlAgilityPack.HtmlDocument TheDocument = TheWebLoader.Load(textBox1.Text);
List<string> TagsToRemove = new List<string>() { "script", "style", "link", "br", "hr" };
var Strings = (from n in TheDocument.DocumentNode.DescendantsAndSelf()
where !TagsToRemove.Contains(n.Name.ToLower())
select n.InnerText).ToList();
textBox2.Lines = Strings.ToArray();
The problem is, it returns the content of the script tag too. I don’t know why that happens. Can anybody help me?
Your problem comes from the fact that InnerText does not return what you expect.
In:
It returns:
Then, for example, for the root node, doing
document.DocumentNode.InnerTextwill give you all the texts inscript, etc…I suggest you to remove all the tags you don’t want:
Then to get the list of the text elements: