I’m writing an application to crawl some websites and scrape data from them. I’m

Question

0

Asked: June 7, 20262026-06-07T16:48:41+00:00 2026-06-07T16:48:41+00:00

I’m writing an application to crawl some websites and scrape data from them. I’m

0

I’m writing an application to crawl some websites and scrape data from them. I’m using Ruby, Curl and Nokogiri to do this. In most cases it’s straightforward and I only need to ping a URL and parse the HTML data. The setup works perfectly fine.

However, in some scenarios, the websites retrieve data based on user input on some radio buttons. This invokes some JavaScript which fetches some more data from the server. The generated URL and posted data is determined by JavaScript code.

Is it possible to use:

A JavaScript library along with this setup which would be able to determine execute the JavaScript in the HTML page for me?
Apart from using a different library, is there some integration or a way for the HTML and JS libraries to communicate? For instance if a button is clicked, Nokogiri needs to call JavaScript and then the JavaScript needs to update Nokogiri.

In case my approach doesn’t seem the best, what would your suggestion be to build a crawler + scraper on the web using Ruby.

EDIT: Looks like point 1 is possible using therubyrace as it embeds the V8 engine in your code, but is there an alternative to 2?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-07T16:48:44+00:00

You are looking for Watir which runs a real browser and allows you to perform every action you can think of on a web page. There’s a similar project called Selenium.

You can even use Watir with a so-called ‘headless’ browser on a linux machine.

Watir headless example

Suppose we have this HTML:

<p id="hello">Hello from HTML</p>

and this Javascript:

document.getElementById('hello').innerHTML = 'Hello from JavaScript';

(Demo: http://jsbin.com/ivihur)

and you wanted to get the dynamically inserted text. First, you need a Linux box with xvfb and firefox installed, for example on Ubuntu do:

$ apt-get install xvfb firefox

You will also need the watir-webdriver and headless gems so go ahead and install them as well:

$ gem install watir-webdriver headless

Then you can read the dynamic content from the page with something like this:

require 'rubygems'
require 'watir-webdriver'
require 'headless'

headless = Headless.new
headless.start
browser = Watir::Browser.new

browser.goto 'http://jsbin.com/ivihur' # our example
el = browser.element :css => '#hello'
puts el.text

browser.close
headless.destroy

If everything went right, this will output:

Hello from JavaScript

I know this runs a browser in the background as well, but it’s the easiest solution to your problem i could come up with. It will take quite a while to start the browser, but subsequent requests are quite fast. (Running goto and then fetching the dynamic text above multiple times took about 0.5 sec for each request on my Rackspace Cloud Server).

Source: http://watirwebdriver.com/headless/

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m writing an application to crawl some websites and scrape data from them. I’m

Leave an answerCancel reply

1 Answer

Watir headless example

Leave an answer
Cancel reply