A page I am trying to scrape into a CSV database/Ruby array lists 470

Question

0

Asked: June 4, 20262026-06-04T15:11:29+00:00 2026-06-04T15:11:29+00:00

A page I am trying to scrape into a CSV database/Ruby array lists 470

0

A page I am trying to scrape into a CSV database/Ruby array lists 470 total records of uneven sized groups, each group preceded by a date (22 unique dates total).

I am not sure how to do it since groups aren’t organized into any HTML tables, nor any hierarchy in the DOM where a “parent” could lead to each group’s date, only a dry list of <div class="line"> visible record divs, occasionally preceded by only a <span class="date">Thursday, May 24, 2012</span> holding the date that applies only to the next X records until a new date is printed.

In irb it correctly shows:

$page = $agent.get(pageurl) # gets page with Mechanize
doc = $page.parser # returns Nokogiri::HTML 

(records = doc.search('html body div#wrapper div#innerwrapper div#content div.line')).size 
=> 470
(dates = doc.search('html body div#wrapper div#innerwrapper div#content span.date')).size 
=> 22

Show the first date for example:

doc.search('html body div#wrapper div#innerwrapper div#content span.date')[0].text
=> "Wednesday, May 23, 2012"

My goal is to append the correct date as a field to each of the 470 records doc.search found above, before saving into a CSV file.

Can Nokogiri (or Mechanize) help me retrieve these records in groups based on their position in the DOM, i.e. immediately following dates[N].text but before the next <span class="date">?

I could iterate N from 0 to 21 appending to a master array/CSV object for all 470 records, but for each group, adding the appropriate date field.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-04T15:11:30+00:00

First, you can simplify your search a bit. Since content is an id, and it by definition uniquely identifies that particular div, you don’t need any of the preceding path information.

records = doc.search('div#content div.line')

From each record, you can pull the date using xpath’s preceding-sibling axis. Altogether:

doc.search('div#content div.line').each do |record|
  date = record.xpath('preceding-sibling::span[@class="date"][1]').text
  #append to CSV
end

The XPath says: find the preceding spans at the same level (preceding-sibling::span) that have a class of “date” ([@class="date"]), and take the first such one ([1]) to ensure you get the nearest date span).

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

A page I am trying to scrape into a CSV database/Ruby array lists 470

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply