I’m writing an application to crawl some websites and scrape data from them. I’m using Ruby, Curl and Nokogiri to do this. In most cases it’s straightforward and I only need to ping a URL and parse the HTML data. The setup works perfectly fine.
However, in some scenarios, the websites retrieve data based on user input on some radio buttons. This invokes some JavaScript which fetches some more data from the server. The generated URL and posted data is determined by JavaScript code.
Is it possible to use:
-
A JavaScript library along with this setup which would be able to determine execute the JavaScript in the HTML page for me?
-
Apart from using a different library, is there some integration or a way for the HTML and JS libraries to communicate? For instance if a button is clicked, Nokogiri needs to call JavaScript and then the JavaScript needs to update Nokogiri.
In case my approach doesn’t seem the best, what would your suggestion be to build a crawler + scraper on the web using Ruby.
EDIT: Looks like point 1 is possible using therubyrace as it embeds the V8 engine in your code, but is there an alternative to 2?
You are looking for Watir which runs a real browser and allows you to perform every action you can think of on a web page. There’s a similar project called Selenium.
You can even use Watir with a so-called ‘headless’ browser on a linux machine.
Watir headless example
Suppose we have this HTML:
and this Javascript:
(Demo: http://jsbin.com/ivihur)
and you wanted to get the dynamically inserted text. First, you need a Linux box with
xvfbandfirefoxinstalled, for example on Ubuntu do:You will also need the
watir-webdriverandheadlessgems so go ahead and install them as well:Then you can read the dynamic content from the page with something like this:
If everything went right, this will output:
I know this runs a browser in the background as well, but it’s the easiest solution to your problem i could come up with. It will take quite a while to start the browser, but subsequent requests are quite fast. (Running
gotoand then fetching the dynamic text above multiple times took about 0.5 sec for each request on my Rackspace Cloud Server).Source: http://watirwebdriver.com/headless/