I am using urllib2 to do an http post request using Python 2.7.3. My request is returning an HTTPError exception (HTTP Error 502: Proxy Error).
Looking at the messages traffic with Charles, I see the following is happening:
- I send the HTTP request (POST /index.asp?action=login HTTP/1.1) using urllib2
- The remote server replies with status 303 and a location header of ../index.asp?action=news
- urllib2 retries sending a get request: (GET /../index.asp?action=news HTTP/1.1)
- The remote server replies with status 502 (Proxy error)
The 502 reply includes this in the response body: “DNS lookup failure for: 10.0.0.30:80index.asp” (Notice the malformed URL)
So I take this to mean that a proxy server on the remote server’s network sees the “/../index.asp” URL in the request and misinterprets it, sending my request on with a bad URL.
When I make the same request with my browser (Chrome), the retry is sent to GET /index.asp?action=news. So Chrome takes off the leading “/..” from the URL, and the remote server replies with a valid response.
Is this a urllib2 bug? Is there something I can do so the retry ignores the “/..” in the URL? Or is there some other way to solve this problem? Thinking it might be a urllib2 bug, I swapped out urllib2 with requests but requests produced the same result. Of course, that may be because requests is built on urllib2.
Thanks for any help.
The Location being sent with that 302 is wrong in multiple ways.
First, if you read RFC2616 (HTTP/1.1 Header Field Definitions) 14.30 Location, the Location must be an absoluteURI, not a relative one. And section 10.3.3 makes it clear that this is the relevant definition.
Second, even if a relative URI were allowed, RFC 1808, Relative Uniform Resource Locators, 4. Resolving Relative URLs, step 6, only specifies special handling for
..in the pattern<segment>/../. That means that a relative URL shouldn’t start with... So, even if the base URL ishttp://example.com/foo/bar/and the relative URL is../baz/, the resolved URL is nothttp://example.com/foo/baz/, buthttp://example.com/foo/bar/../baz. (Of course most servers will treat these the same way, but that’s up to each server.)Finally, even if you did combine the relative and base URLs before resolving
.., an absolute URI with a path starting with..is invalid.So, the bug is in the server’s configuration.
Now, it just so happens that many user-agents will work around this bug. In particular, they turn
/../foointo/footo block users (or arbitrary JS running on their behalf without their knowledge) from trying to do "escape from webroot" attacks.But that doesn’t mean that
urllib2should do so, or that it’s buggy for not doing so. Of courseurllib2should detect the error earlier so it can tell you "invalid path" or something, instead of running together an illegal absolute URI that’s going to confuse the server into sending you back nonsense errors. But it is right to fail.It’s all well and good to say that the server configuration is wrong, but unless you’re the one in charge of the server, you’ll probably face an uphill battle trying to convince them that their site is broken and needs to be fixed when it works with every web browser they care about. Which means you may need to write your own workaround to deal with their site.
The way to do that with
urllib2is to supply your ownHTTPRedirectHandlerwith an implementation ofredirect_requestmethod that recognizes this case and returns a differentRequestthan the default code would (in particular,http://example.com/index.asp?action=newsinstead ofhttp://example.com/../index.asp?action=news).