I fetch webpages in order to feed data to my application. However, the pages contain a lot of images which I don’t require at all. I only need the text data.
My problem is that the web requests take an unacceptable amount of time. I think the images also are fetch during a web request. Is there any way to eliminate the images and download only the text data?
The following is the code that I am using currently.
var httpWebRequest = HttpWebRequest.Create(url) as HttpWebRequest;
httpWebRequest.Method = "GET";
httpWebRequest.ProtocolVersion = HttpVersion.Version11;
httpWebRequest.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip,deflate");
httpWebRequest.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
httpWebRequest.Proxy = null;
httpWebRequest.KeepAlive = true;
httpWebRequest.Accept = "text/html";
string responseString = null;
var httpWebResponse = httpWebRequest.GetResponse() as HttpWebResponse;
using (var responseStream = httpWebResponse.GetResponseStream())
{
using (var streamReader = new StreamReader(responseStream))
{
responseString = streamReader.ReadToEnd();
}
}
Also, any other optimization suggestions are most welcome.
That is incorrect.
HttpWebRequestdoes not know anything about HTML or images; it just sends raw HTTP requests.You can use Fiddler to see exactly what’s going on.