I am making a DownloadString function in order to retrieve HTML data (since the WebClient lacks quite a bit of speed =/)
Here’s what i have so far…
public static string DownloadString(string url)
{
TcpClient client = new TcpClient();
client.Client.ReceiveTimeout = 5;
string dns = UrlToDNS(url);
byte[] buffer = new byte[51200];
client.Client.Connect(dns, 80);
string getVal = url.Substring(url.IndexOf(dns) + dns.Length);
string HTTPHeader = "GET " + getVal + " HTTP/1.1\nHost: " + dns + "\nConnection: close\nUser-Agent: Pastebin API 0.1\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\nAccept-Charset: ISO-8859-1,UTF-8;q=0.7,*;q=0.7\nCache-Control: no-cache\nAccept-Language: en;q=0.7,en-us;q=0.3\n\n";
client.Client.Send(s2b(HTTPHeader));
client.Client.Receive(buffer);
return b2s(buffer);
}
private static string b2s(byte[] ba)
{
string ret = "";
foreach (byte b in ba)
ret += Convert.ToChar(b);
return ret;
}
(s2b not necessary since the http server returns OK)
However, when i run the code (with http://www.google.com/ as a test), it seems that some of the data is dropped/not read:
HTTP/1.1 200 OK
Date: Sat, 20 Aug 2011 15:18:28 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
Set-Cookie: PREF=ID=3714446c9ffb56bf:FF=0:TM=1313853508:LM=1313853508:S=mu1XpTcwqFTwgwJM; expires=Mon, 19-Aug-2013 15:18:28 GMT; path=/; domain=.google.com
Set-Cookie: NID=50=B8YKlYj7eK84obqC5YO10AKF9jJNcQ5w4NkzidRL9of0Sc24EpbWeP-w7HVfm-eBCfE2NX2QMZAfEBpsqsgjhWqylFUIXU-bs6ObkLQbXJ59sa_daivfBLYJkQvq_WH; expires=Sun, 19-Feb-2012 15:18:2>8 GMT; path=/; domain=.google.com; HttpOnly
Server: gws
X-XSS-Protection: 1; mode=block
Connection: close
<!doctype html><html><head><meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"><meta name="description" content="Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."><meta name="robots" content="noodp"><title>Google</title><script>window.google={kEI:"RNBPTvPcI5C_gQeywpHfBg",getEI:function(a){var b;while(a&&!(a.getAttribute&&(b=a.getAttribute("eid"))))a=a.parentNode;return b||google.kEI},kEXPI:"28936,29049,29774,30465,30542,31760",kCSI:{e
To add another complication, it seems to drop a variable amount of data each time; I haven’t gotten consistent results with how much data is lost, sometimes it loses only a small amount and sometimes (like the example) a larger amount
Any ideas on what is causing it? (or a better method of retrieving the source code of a webpage without WebClient)
(also ignore the fact that the input and output data hasn’t been sanitized)
You should use a
WebClient.DownloadString. I very highly doubt that it is this method that is slow and causing you performance problems.But if you want to reinvent wheels, here’s a cleaner approach:
Obviously this code doesn’t follow HTTP redirects from the server. It is very basic. Much more will be required to get all the functionality you would get from a
WebClient.DownloadString.