I am working on some scraping app, i wanted to try to get it to work but ran into a problem. I have replaced the original scraping destination in the below code with googles webpage, just for testing. It seems that my download doesnt get everything, i note that the body and the html tags are missing their close tags. How do i get it to download everything? Whats wrong with my sample code:
string filename = "test.html";
WebClient client = new WebClient();
string searchTerm = HttpUtility.UrlEncode(textBox2.Text);
client.QueryString.Add("q", searchTerm);
client.QueryString.Add("hl", "en");
string data = client.DownloadString("http://www.google.com/search");
StreamWriter writer = new StreamWriter(filename, false, Encoding.Unicode);
writer.Write(data);
writer.Flush();
writer.Close();
Google’s web pages are now in HTML 5, meaning the
BODYandHTMLtags can be self-closed – which is why Google omits them (believe it or not, it saves them bandwidth.)See this article.
You can write HTML5 in either “HTML/SGML” mode (which allows the omitting of closing tags like HTML did prior to XHTML) or in “XHTML” which follows the rules of XML, requiring all tags to be closed.
Which the browser chooses to parse the page depends on whether you send a
Content-typeheader oftext/htmlfor HTML/SGML syntax orapplication/xhtml+xmlfor XHTML syntax. (Source: HTML5 syntax – HTML vs XHTML)