I am currently playing around with different scraping techniques and found out, that it can get pretty complicated quickly when a lot of javascript is involved.
I had some success with HTMLUnit which seems to interpret javascript rather well, but I am looking for a more lightweight solution.
So the problem I am facing now is: I want to retrieve the results of a specific page, which is generated by an ajax call by a click on a certain button.
The call itself is rather simple, just a HTTP Post to a certain URL with a few parameters submitted in the post body. The problem I have now is that the server complains when I submit the HTTP Post to the ajax function without really opening the containing site.
What I basically do for testing is:
curl -v -d "AJAXREQUEST=..." https://myhost/ajaxurl
An what I get is:
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="Ajax-Response" content="true" />
<meta name="Ajax-Expired" content="View state could't be restored - reload page ?" />
</head>
</html>
The server is running JSF 1.2. What do I have to do, to get the results from the AJAX call? I am not really a JSF expert…
If I had to guess, JSF doesn’t have a session associated with the request being sent with curl and therefore the objects associated with the page don’t exist. For curl look at http://curl.haxx.se/docs/httpscripting.html section 10, cookies. You would have to pull the page, get the cookies then do the http post with the cookies (starts being a lot of work with curl).
However I would instead suggest looking at Selenium, which has a IDE that generates Java to interact with JavaScript.