My python program involves going to a user-supplied url and then doing stuff on the page. Ideally, mistyped urls would be recognized and pop up an error. But if they have the right syntax and just don’t point anywhere, then either an ISP error page or an ad site is loaded instead.
For example:
“http://washingtonn.edu” –> http://search5.comcast.com/?cat=dnsr&con=dsqcy&url=washingtonn.edu
“http://www.amazdon.com/” –> http://www.amazdon.com/
Is there any way to detect these without knowing all the possible pages? The second one might be pretty hard because it’s an actual site, but I’d be happy with catching the first.
Thanks!
Unless I am misunderstanding your question, what you ask for is impossible, doesn’t make sense, or is far far from trivial.
If you think about it, other than a 404 error, where you detect that a page does not exist, if a page does exist there is not way of knowing whether the page is “good” or “bad” as this is subjective. It might be possible to apply some general rules, but you can’t make embrace all the possibilities.
The only way would be something like what Google does with the suggestions, but this would imply a huge database with a list of popularity of websites, and test every time for proximity, but that is far beyond trivial and probably not necessary.
For handling 404 statutes in python you could use lie httplib.
Good luck!