I am trying to get the final redirected URL in scrapy. For example, if an anchor tag has a specific format:
<a href="http://www.example.com/index.php" class="FOO_X_Y_Z" />
Then I need to get the URL that URL redirects to (if it does, if its 200 then OK). For example, I get the appropriate anchor tags like this:
def parse (self, response)
hxs = HtmlXPathSelector (response);
anchors = hxs.select("//a[@class='FOO_X_Y_Z']/@href");
// Lets assume anchor contains the actual link (http://...)
for anchor in anchors:
final_url = get_final_url (anchor); // << I would need something like this
// Save final_url
So if I visited http://www.example.com/index.php and that would send me through 10 redirects and finally it would stop at http://www.example.com/final.php – this is what I would need get_final_url() to return.
I thought of hacking my way to a solution but am asking here to see if scrapy has one already provided?
Again, assuming
anchorcontains an actual URL, I went and accomplished it with urllib2:urllib2.open()returns a file-like object with two additional methods, one of them beinggeturl()which returns the final URL (after all redirects have been followed). Its not part of Scrapy, but it works.