I am using urllib (note not urllib2) and getting title of pages from user supplied urls. Unfortunately sometimes the url is not an HTML but some huge file or some very long running process on the remote site.
I have checked the python docs but urllib is limited and looking at the source it seems I could change it but I cannot do so on the server. there is mention of info() but no example on how to implement it.
I am using FancyURLopener which I guess is not available in urllib2 and I dont know if urllib2 can solve the problem
- Is there way to define a socket timeout?
- more importantly, how do I limit the request to HTML/XHTML content type only and ignore anything else totally i.e. I want to ensure the entire content is not downloaded
I am still going through urllib source and checking urllib2 but I am no expert on these tools.
Here, it states that the
info()method returns meta-information associated with the URL. You could use this to get the headers, and see what the Content-Type is (text/html), and if it’s not what you want, discard the request.I’ve hacked together something quick to allow specifying a
HEADrequest for you inurllib. 🙂