So I want to create a web crawler in C. There are hardly any libraries to support this.
I can use libtidy to convert HTML to XHTML and get the HTML files using libcurl (which has decent documentation).
My problem is parsing the HTML files and getting all the links present in it. I know libxml2
is there but its extremely hard to understand because there is no good documentation for its API.
Should I even do this in C or go with another language like Java ?
Or are there any good alternatives to libxml2 ?
Parsing HTML requires basically just string manipulation.
But it’s quite hard to do without an HTML or XML (if it’s XHTML) parser.
As for the second part of the question I woudn’t choose C for such task because even basic string operations are much complex than many other languages that support them natively.
I would go for a scripting lanuguage such Python, JavaScript, PHP…
Instead of using libcurl you’ll invoke curl as a command line tool.
Btw: libcurl documentation is very good (in my opinion).