I am working on a web crawler. I am using the Webbrowser control for this purpose. I have got the list of urls stored in database and I want to traverse all those URLs one by one and parse the HTML.
I used the following logic
foreach (string href in hrefs)
{
webBrowser1.Url = new Uri(href);
webBrowser1.Navigate(href);
}
I want to do some work in the “webBrowser1_DocumentCompleted” event once the page is loaded completely. But the “webBrowser1_DocumentCompleted” does not get the control as I am using the loop here. It only get the control when the last url in “hrefs” is navigated and the control exits the loop.
Whats the best way to handle such problem?
Store the list somewhere in your state, as well as the index of where you’ve got to. Then in the
DocumentCompletedevent, parse the HTML and then navigate to the next page.(Personally I wouldn’t use the
WebBrowsercontrol for web crawling… I know it means it’ll handle the JavaScript for you, but it’ll be a lot harder to parallelize nicely than using multipleWebRequestorWebClientobjects.)