I’m writing a bot to automatically download pages from my WordPress blog. The bot gets most of the pages without a problem. For example, it can easily get the first page of the article listing of a given tag: http://example.com/myblog/index.php/archives/tag/mytag. However, for some reason it can’t get the subsequent pages, like http://example.com/myblog/index.php/archives/tag/mytag/page/2.
I’ve tried to figure out what was going on, and here’s what I found: while the server answers normally to most requests, upon such requests it answers with a 301 permanent redirect. Peculiarly, the Location header is set to the exact same URL as the request! Basically, the server tells me to redirect my request of the page http://example.com/myblog/index.php/archives/tag/mytag/page/2 to… the very same page 😛
When trying to access the page from the browser I get the page without a problem. I thought maybe the browser sends some headers (including cookies) that my bot doesn’t send, so I copied the headers (including the cookies) from my browser’s web console, but the behaviour didn’t change.
I would appreciate any suggestions regarding what might be causing this strange behaviour, what I can do in order to understand what’s going on better, and of course what I can do in order to fetch those pages automatically, just like I fetch their brethren.
Thanks!
It seems this post hasn’t generated much public interest. However, in case somebody ever runs into the same problem and finds this post, here’s the solution I used. Important note: I still don’t understand the behaviour I witnessed, and would appreciate it if somebody could explain it.
So the solution I’ve found is basically to use the URL http://example.com/myblog/archives/tag/mytag?paged=2 instead of http://example.com/myblog/index.php/archives/tag/mytag/page/2. Funnily enough, this URL gets redirected to the original one when browsed to from a browser! But when the bot requested it it got the page without redirection or anything. (So I managed to do what I wanted to do, but I’ve got no idea what happened there, why there was a problem in the first place, and why this solution worked: for one URL the bot gets infinite redirection and the browser just gets the page, while for the other the browser gets redirected [finitely] and the bot gets the page. I am yet to figure this one out…)