Please note: I do not want to read the HTML content of a page, rather, I am looking to read the text from a web page. Imagine the following example, if you will –
A PHP script echos back “Hello User X” onto the current page, so that the user is now looking at a page (mainly blank) with the words “Hello User X” printed in the top left corner. From my C# Application, I would like to read the text onto a string.
String strPageData = functionToReadPageData("http://www.myURL.com/file.php");
Console.WriteLine(strPageData); // Outputs "Hello User X" to the Console.
In VB6 I was able to do this by using the following API:
- InternetOpen
- InternetOpenURL
- InternetReadFile
- InternetCloseHandle
I attempted to port my VB6 code to C# but I am having no luck – so I would very much appreciate a C# method for completing the above task.
I am not aware of any parts of the .NET framework that lets you automagically extract all the text from a HTML file. I very much doubt it exists.
You can try the HtmlAgilityPack (3rd party) for accessing text elements etc in a HTML document.
You will still need to write logic to find the correct HTML element though. A HTML page like this:
Then you would need to locate the body tag with an xpath and read its content.
Following that pattern you can read every element on the page. You might need to do some post processing to remove breaks, comments etc.
http://htmlagilitypack.codeplex.com/wikipage?title=Examples