I’m using the HTMLAgilityPack to parse HTML pages. However at some point I try to parse wrong data (in this specific case an image), which ofc fails for obvious reasons.
Private Sub parseHtml(ByVal content As String, ByVal url As String)
Try
Dim contentHash As String = hashGenerator.ComputeHash(content, "SHA1")
Dim doc As HtmlDocument = New HtmlDocument()
doc.Load(New StringReader(content))
Dim root As HtmlNode = doc.DocumentNode
Dim anchorTags As New List(Of String)
For Each link As HtmlNode In root.SelectNodes("//a")
cururl = link.OuterHtml
If link.Attributes("href") Is Nothing Then Continue For
If Uri.IsWellFormedUriString(link.Attributes("href").Value, UriKind.Absolute) Then
urlQueue.Enqueue(link.Attributes("href").Value)
Else
Dim myUri As New Uri(url)
urlQueue.Enqueue(myUri.Scheme & "://" & myUri.Host & link.Attributes("href").Value)
End If
Next
Catch ex As Exception
MsgBox(ex.Message, MsgBoxStyle.Critical, "Error (parseHtml(" & url & "))")
End Try
End Sub
The error I get is:
A first chance exception of type
‘System.NullReferenceException’
occurred in Webcrawler.exe Object
reference not set to an instance of an
object.
On the content I try to parse:
�����Iޥ�+�: 8�0�x�
How to check whether the content is ‘parse-able’ before trying to parse it to prevent the error?
For now it is an image which makes an error popup however I think it might be just anything which isn’t (x)html.
Thanks in advance ow great community 🙂
You need to check the returned
content-typeheader before trying to parse the returned data.For an HTML page this should be
text/html, for XHTML is would beapplication/xhtml+xml.