I am downloading URLs in Python and need to detect 404s, so after some

Question

0

Asked: May 20, 20262026-05-20T07:53:26+00:00 2026-05-20T07:53:26+00:00

I am downloading URLs in Python and need to detect 404s, so after some

0

I am downloading URLs in Python and need to detect 404s, so after some search I came up with:

import urllib
class MyUrlOpener(urllib.FancyURLopener):
    def retrieve(self, url, filename=None, reporthook=None, data=None):
        self.file_was_found = True
        val = urllib.FancyURLopener.retrieve(self, url, filename, reporthook, data)        
        return val

    def http_error_404(url, fp, errcode, errmsg, headers, data):
        url.file_was_found = False


def download_file(url, saveas):
    urlaccess = MyUrlOpener()
    localFile, headers = urlaccess.retrieve(url, saveas)
    return urlaccess.file_was_found

My question is that if you look at the source code (Python 2.7) for FancyURLopener then you see:

def http_error(self, url, fp, errcode, errmsg, headers, data=None):
    """Handle http errors.
    Derived class can override this, or provide specific handlers
    named http_error_DDD where DDD is the 3-digit error code."""
    # First check if there's a specific handler for this error
    name = 'http_error_%d' % errcode
    if hasattr(self, name):
        method = getattr(self, name)
        if data is None:
            result = method(url, fp, errcode, errmsg, headers)
        else:
            result = method(url, fp, errcode, errmsg, headers, data)
        if result: return result
    return self.http_error_default(url, fp, errcode, errmsg, headers)

Which is passing the url as the first parameter and not self. I thought that the first parameter to a function was always a reference to the class instance (by convention) and my code confirms this. So what happens to the url value?

UPDATE: It turns out that data==None so it was calling the first signature. This foiled my attempts to manually add the self parameter. As soon as I added the =None default to data in my http_error_404 signature all was well (because it used the default).

The fixed / correct signature is def http_error_404(self, url, fp, errcode, errmsg, headers, data=None):

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-20T07:53:27+00:00

In Python, any class instance’s method has self passed in by the Python interpreter and all of the other arguments are shifted down one place automatically.

In other words the Python interpreter rewrites:

urlaccess.retrieve(url, saveas)

into something that looks like this:

urlaccess.retrieve(urlaccess, url, saveas)

So you don’t have to do it yourself. However, since

explicit is better than implicit

any instance methods you declare for a Python object must specify explicitly that they take the instance of the object as their first argument even though Python will pass that argument without any action on the part of the programmer.

The first argument does not have to be called self … that is only a convention.

So, to actually answer your question though (as mluebke did) — you need to specify the self argument.

def http_error_404(url, fp, errcode, errmsg, headers, data):
    url.file_was_found = False
    # Python is treating `url` as `self`
    # Therefore the URL is being saved in `fp`, `fp` in `errcode`, etc.

To fix this problem add a first argument to pick up the instance.

def http_error_404(self, url, fp, errcode, errmsg, headers, data):
    self.file_was_found = False
    # Now everything should work

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am downloading URLs in Python and need to detect 404s, so after some

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply