What I’m doing
I am writing a web crawler in OCaml. Using the function string_of_uri (below) defined by nlucaroni in a previous answer to a question I posted, I can fetch the HTML text of a URL from the web.
let string_of_uri uri =
try let connection = Curl.init () and write_buff = Buffer.create 1763 in
Curl.set_writefunction connection
(fun x -> Buffer.add_string write_buff x; String.length x);
Curl.set_url connection uri;
Curl.perform connection;
Curl.global_cleanup ();
Buffer.contents write_buff;
with _ -> raise (IO_ERROR uri)
I’ve already written some code to extract a list of all the hyperlinks in the fetched HTML (i.e. all the [LINK] parts in anything like <A HREF="[LINK]">text</A>). This all works fine.
The Problem
The problem is that some pages redirect you and I don’t know how to follow the redirection. For example, my program will output 0 tags in the page http://en.wikipedia.org because Wikipedia will actually redirect you to http://en.wikipedia.org/wiki/Main_Page. If I give this last page to my program, it all works fine. But if I give the initial one, it just returns 0 <A> tags.
Unfortunately there’s no documentation at all for ocurl, except for the names of the functions in the interface. Does any one have an idea on how I can improve the function string_of_uri above so that it follows any possible redirections and outputs the HTML of the last page it falls in?
I noticed that applying the function Curl.get_redirectcount to a connection on http://en.wikipedia.org returns 0, which is not what I was expecting, since the page is redirected to some other page…
Thanks for any help!
All the best,
Surikator.
This question has already been answered in the comments of this answer. The solution is to add
Curl.set_followlocation connection truejust aboveCurl.perform connection.