I want to write an application that gets a list of urls.
For each of them I need to monitor periodically if the content has changed.
I thought :
-
to use HtmlAgilityPack to fetch html content (any other recommendation?)
-
I don’t need to spot the change itself,
so I though to hash the content, save it in the DB
and re-compare the has in the future.
How would you suggest hashing? .net’s GetHashCode() ?
I saw this documentation http://support.microsoft.com/kb/307020
which advise using
tmpSource = ASCIIEncoding.ASCII.GetBytes(sSourceData);
why?
You should absolutely not use
GetHashCode()for this. The documentation explicitly states:The results of
GetHashCodecan change between runs – all that’s guaranteed is that calling it on two equal objects in the same process (possibly AppDomain) will give the same hash code. Indeed,String.GetHashCode‘s algorithm has changed over time, and in .NET 4 the 32-bit implementation is different to the 64-bit implementation.If you want to use hashing, use MD5, SHA1 etc – something with a specified algorithm which will not change. (Note that these operation on binary data rather than string data, which is probably more appropriate too – you don’t need to bother decoding the data as text.)
It’s not clear to me whether refetching periodically is really the best idea though – do these servers not support last modified times, etags etc?