Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6929935
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 27, 20262026-05-27T11:22:53+00:00 2026-05-27T11:22:53+00:00

I am using urllib (note not urllib2) and getting title of pages from user

  • 0

I am using urllib (note not urllib2) and getting title of pages from user supplied urls. Unfortunately sometimes the url is not an HTML but some huge file or some very long running process on the remote site.

I have checked the python docs but urllib is limited and looking at the source it seems I could change it but I cannot do so on the server. there is mention of info() but no example on how to implement it.

I am using FancyURLopener which I guess is not available in urllib2 and I dont know if urllib2 can solve the problem

  1. Is there way to define a socket timeout?
  2. more importantly, how do I limit the request to HTML/XHTML content type only and ignore anything else totally i.e. I want to ensure the entire content is not downloaded

I am still going through urllib source and checking urllib2 but I am no expert on these tools.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-27T11:22:54+00:00Added an answer on May 27, 2026 at 11:22 am

    Here, it states that the info() method returns meta-information associated with the URL. You could use this to get the headers, and see what the Content-Type is (text/html), and if it’s not what you want, discard the request.

    >>> import urllib
    >>> d = urllib.urlopen('http://www.google.com/')
    >>> try:
    ...     if d.info()['content-type'].startswith('text/html'):
    ...             print 'its html'
    ...     else:
    ...             print 'its not html'
    ... except KeyError:
    ...     print 'its not html'
    ... 
    its html
    

    I’ve hacked together something quick to allow specifying a HEAD request for you in urllib. 🙂

    import urllib
    import socket
    from urllib import unwrap, toBytes, quote, splittype, splithost, splituser, unquote, addinfourl
    
    class MyURLOpener(urllib.FancyURLopener):
        def open_http(self, url, data=None, method=None):
            """Use HTTP protocol."""
            import httplib
            user_passwd = None
            proxy_passwd= None
            if isinstance(url, str):
                host, selector = splithost(url)
                if host:
                    user_passwd, host = splituser(host)
                    host = unquote(host)
                realhost = host
            else:
                host, selector = url
                # check whether the proxy contains authorization information
                proxy_passwd, host = splituser(host)
                # now we proceed with the url we want to obtain
                urltype, rest = splittype(selector)
                url = rest
                user_passwd = None
                if urltype.lower() != 'http':
                    realhost = None
                else:
                    realhost, rest = splithost(rest)
                    if realhost:
                        user_passwd, realhost = splituser(realhost)
                    if user_passwd:
                        selector = "%s://%s%s" % (urltype, realhost, rest)
                    if proxy_bypass(realhost):
                        host = realhost
    
                #print "proxy via http:", host, selector
            if not host: raise IOError, ('http error', 'no host given')
    
            if proxy_passwd:
                import base64
                proxy_auth = base64.b64encode(proxy_passwd).strip()
            else:
                proxy_auth = None
    
            if user_passwd:
                import base64
                auth = base64.b64encode(user_passwd).strip()
            else:
                auth = None
            h = httplib.HTTP(host)
    
            if method is not None:
                h.putrequest(method, selector)
            else:
                h.putrequest('GET', selector)
    
            if data is not None:
                #h.putrequest('POST', selector)
                h.putheader('Content-Type', 'application/x-www-form-urlencoded')
                h.putheader('Content-Length', '%d' % len(data))
    
            if proxy_auth: h.putheader('Proxy-Authorization', 'Basic %s' % proxy_auth)
            if auth: h.putheader('Authorization', 'Basic %s' % auth)
            if realhost: h.putheader('Host', realhost)
            for args in self.addheaders: h.putheader(*args)
            h.endheaders(data)
            errcode, errmsg, headers = h.getreply()
            fp = h.getfile()
            if errcode == -1:
                if fp: fp.close()
                # something went wrong with the HTTP status line
                raise IOError, ('http protocol error', 0,
                                'got a bad status line', None)
            # According to RFC 2616, "2xx" code indicates that the client's
            # request was successfully received, understood, and accepted.
            if (200 <= errcode < 300):
                return addinfourl(fp, headers, "http:" + url, errcode)
            else:
                if data is None:
                    return self.http_error(url, fp, errcode, errmsg, headers)
                else:
                    return self.http_error(url, fp, errcode, errmsg, headers, data)
    
        def open(self, fullurl, data=None, method=None):
            """Use URLopener().open(file) instead of open(file, 'r')."""
            fullurl = unwrap(toBytes(fullurl))
            # percent encode url, fixing lame server errors for e.g, like space
            # within url paths.
            fullurl = quote(fullurl, safe="%/:=&?~#+!$,;'@()*[]|")
            if self.tempcache and fullurl in self.tempcache:
                filename, headers = self.tempcache[fullurl]
                fp = open(filename, 'rb')
                return addinfourl(fp, headers, fullurl)
            urltype, url = splittype(fullurl)
            if not urltype:
                urltype = 'file'
            if urltype in self.proxies:
                proxy = self.proxies[urltype]
                urltype, proxyhost = splittype(proxy)
                host, selector = splithost(proxyhost)
                url = (host, fullurl) # Signal special case to open_*()
            else:
                proxy = None
            name = 'open_' + urltype
            self.type = urltype
            name = name.replace('-', '_')
            if not hasattr(self, name):
                if proxy:
                    return self.open_unknown_proxy(proxy, fullurl, data)
                else:
                    return self.open_unknown(fullurl, data)
            try:
                return getattr(self, name)(url, data, method)
            except socket.error, msg:
                raise IOError, ('socket error', msg), sys.exc_info()[2]
    
    
    opener = MyURLOpener()
    
    # NOTE: including any data no longer implicitly makes the method POST,
    #       so you must now specify the method to POST if you include data
    # NOTE: this overrides only open_http, and not open_https, but you can
    #       use a similar technique, and override open_https as well
    
    d = opener.open('http://www.google.com/', method='HEAD')
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am not able to open a url for read() using urllib or urllib2
I cannot request url http://www.besondere-raumdüfte.de with urllib2.urlopen(). I tried to encode string using urllib.urlencode
I found that you can't read from some sites using Python's urllib2(or urllib). An
I'm using urllib.urlopen to read a file from a URL. What is the best
I'd like to retrieve data from a specific webpage by using urllib library. The
I'm having little trouble creating a script working with URLs. I'm using urllib.urlopen() to
When using urllib2 (and maybe urllib) on windows python seems to magically pick up
I'm using urllib2 's urlopen function to try and get a JSON result from
I'm using urllib.urlopen with some http proxies and sometimes (probably when they require authorization)
Overview I am using urlopen from the Python 2.7.1 urllib2 package to do a

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.