I need to retrieve a value from a website (can vary and I have no control over the site). I currently have some code that works… but takes a very long time to run. I know that there is a vastly improved way of doing this, I just don’t know what that is.
I have considered several alternatives like Regex and the HTMLAgilityPack (seems complex and possibly overkill?) but without trying each of them I am not sure what would be most efficient. And I am sure there are many more possibilities as well.
The problem may even be with how I am retrieving the page rather than how I am processing it.
Dim GETURL As WebRequest
GETURL = WebRequest.Create("http://www.example.com")
Dim objStream As Stream = GETURL.GetResponse.GetResponseStream()
Dim objReader As New StreamReader(objStream)
Dim sLine As String = ""
Dim a As Integer = 0
Dim result As String = ""
Do While Not sLine Is Nothing
a += 1
sLine = objReader.ReadLine
If Not sLine Is Nothing Then
result += sLine
End If
Loop
Dim startTag as string ="<some html tag>"
Dim endTag as string ="<closing tag>"
Dim firstIndex As Integer = result.IndexOf(startTag) + startTag.Length
result = result.Substring(firstIndex, result.Length - firstIndex)
Dim RequiredVal As String = result.Substring(0, result.IndexOf(endTag))
Please note, I do realise just how hideously inefficient this code is, but rather than try loads of different permutations (and probably still have fairly inefficient code), I thought I would ask some experts for their advice first 🙂
UPDATE:
As I didn’t get any response (perhaps my question was a little too vague?) I have been trying to improve efficiency on my own. I have managed to decrease the time it takes to run by ~50% by using WebCient.DownloadString(). This is good but I suspect I can make improvements on extracting the data from the page. Please see updated code below:
Dim client As New WebClient()
Dim result As String = client.DownloadString("http://www.example.com")
Dim startTag as string ="<some html tag>"
Dim endTag as string ="<closing tag>"
Dim firstIndex As Integer = result.IndexOf(startTag) + startTag.Length
result = result.Substring(firstIndex, result.Length - firstIndex)
Dim RequiredVal As String = result.Substring(0, result.IndexOf(endTag))
Any Suggestions would be greatly apprieciated.
If your problem is with waiting for the response from the web request, then the actual engine or technique you use to parse it probably has a lot less to do with performance, than simply waiting for each response from the web synchronously. If you have a long list of pages you’re scraping, then you can do better by running simultaneous requests asynchronously. It’s not clear that’s what is going on though.
Try CsQuery – also on NuGet – a new C# port of jQuery which should do what you want. It has methods for grabbing data synchronously and asynchronously, so if you did want to start parallel web requests, it can do that out of the box. At the most basic level though, the code would be this to do it synchronously:
It works like jquery. The “CQ” object is the same as a jQuery object.
Contentsis the jQuery method to return all children of an element;RenderSelectionis a CsQuery method that renders the full HTML of every element in the selection set. So this would return the full text & html of everything inside everysometagblock.Also it indexes each document for all common selector types and is much faster than HTML Agility Pack.