I am trying to scrape URLs from a page that uses JavaScript. Instead of

Question

0

Asked: June 16, 20262026-06-16T10:07:36+00:00 2026-06-16T10:07:36+00:00

I am trying to scrape URLs from a page that uses JavaScript. Instead of

0

I am trying to scrape URLs from a page that uses JavaScript. Instead of having links on the page, they created onClick events for a number of table rows, whereby, when you click the row, it takes you to the link.

I tried scraping the URLs using Mechanize:

agent = Mechanize.new
page = agent.get(url)

page.links_with(:href => /^http?/).each do |link|
  puts link.href
end

But, looking for links via a HREF reference doesn’t work here, because they’re on the page as part of the onClick event:

<tr onclick="window.open('/someurl');">

Is there a good way to use Mechanize, or some other gem, to parse the code on the page and extract the URLs embedded in the onClick event?

If there’s no good out-of-the-box solution, what might be the best regex to do that? I’m a little new to regex, so not quite able to pull together something on my own yet.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-16T10:07:37+00:00

You should use a parser. Regex and HTML/XML don’t mix well, because regex are not designed to handle the irregularities HTML and XML documents contain. Very simple tasks might work with a pattern, but you’ll quickly find they are fragile and easily broken when the HTML changes.

Mechanize for Ruby, uses Nokogiri internally, which is an excellent way to get at those parameters. You can access Mechanize’s internal Nokogiri document and, from it, find the <tr> tags:

require 'mechanize'

page = Mechanize.new
page = agent.get('http://somesite.foo.com')

page.search('tr[onclick]').map{ |n| n['onclick'][/\(['"]([^)]+)['"]\)/, 1] }

If I use Nokogiri directly to parse this fragment:

<tr onclick="window.open('/someurl');">

I can do this:

require 'nokogiri'

page = Nokogiri::HTML(%[<tr onclick="window.open('/someurl');">])
page.search('tr[onclick]').map{ |n| n['onclick'][/\(['"]([^)]+)['"]\)/, 1] }
=> ["/someurl"]

Notice that I’m searching using a CSS accessor 'tr[onclick]', which makes it pretty easy to find a particular node. If you know JavaScript, CSS or jQuery you’ll find it pretty easy to pick up Nokogiri using its built-in support for CSS.

Also,

n['onclick'][/\(['"]([^)]+)['"]\)/, 1]

could be written alternately as:

n['onclick'][/\(([^)]+)\)/, 1][1..-2]

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to scrape URLs from a page that uses JavaScript. Instead of

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply