A page I am trying to scrape into a CSV database/Ruby array lists 470 total records of uneven sized groups, each group preceded by a date (22 unique dates total).
I am not sure how to do it since groups aren’t organized into any HTML tables, nor any hierarchy in the DOM where a “parent” could lead to each group’s date, only a dry list of <div class="line"> visible record divs, occasionally preceded by only a <span class="date">Thursday, May 24, 2012</span> holding the date that applies only to the next X records until a new date is printed.
In irb it correctly shows:
$page = $agent.get(pageurl) # gets page with Mechanize
doc = $page.parser # returns Nokogiri::HTML
(records = doc.search('html body div#wrapper div#innerwrapper div#content div.line')).size
=> 470
(dates = doc.search('html body div#wrapper div#innerwrapper div#content span.date')).size
=> 22
Show the first date for example:
doc.search('html body div#wrapper div#innerwrapper div#content span.date')[0].text
=> "Wednesday, May 23, 2012"
My goal is to append the correct date as a field to each of the 470 records doc.search found above, before saving into a CSV file.
Can Nokogiri (or Mechanize) help me retrieve these records in groups based on their position in the DOM, i.e. immediately following dates[N].text but before the next <span class="date">?
I could iterate N from 0 to 21 appending to a master array/CSV object for all 470 records, but for each group, adding the appropriate date field.
First, you can simplify your search a bit. Since content is an
id, and it by definition uniquely identifies that particulardiv, you don’t need any of the preceding path information.From each record, you can pull the date using xpath’s
preceding-siblingaxis. Altogether:The XPath says: find the preceding spans at the same level (
preceding-sibling::span) that have a class of “date” ([@class="date"]), and take the first such one ([1]) to ensure you get the nearest date span).