Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7998337
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 4, 20262026-06-04T15:11:29+00:00 2026-06-04T15:11:29+00:00

A page I am trying to scrape into a CSV database/Ruby array lists 470

  • 0

A page I am trying to scrape into a CSV database/Ruby array lists 470 total records of uneven sized groups, each group preceded by a date (22 unique dates total).

I am not sure how to do it since groups aren’t organized into any HTML tables, nor any hierarchy in the DOM where a “parent” could lead to each group’s date, only a dry list of <div class="line"> visible record divs, occasionally preceded by only a <span class="date">Thursday, May 24, 2012</span> holding the date that applies only to the next X records until a new date is printed.

In irb it correctly shows:

$page = $agent.get(pageurl) # gets page with Mechanize
doc = $page.parser # returns Nokogiri::HTML 

(records = doc.search('html body div#wrapper div#innerwrapper div#content div.line')).size 
=> 470
(dates = doc.search('html body div#wrapper div#innerwrapper div#content span.date')).size 
=> 22

Show the first date for example:

doc.search('html body div#wrapper div#innerwrapper div#content span.date')[0].text
=> "Wednesday, May 23, 2012"

My goal is to append the correct date as a field to each of the 470 records doc.search found above, before saving into a CSV file.

Can Nokogiri (or Mechanize) help me retrieve these records in groups based on their position in the DOM, i.e. immediately following dates[N].text but before the next <span class="date">?

I could iterate N from 0 to 21 appending to a master array/CSV object for all 470 records, but for each group, adding the appropriate date field.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-04T15:11:30+00:00Added an answer on June 4, 2026 at 3:11 pm

    First, you can simplify your search a bit. Since content is an id, and it by definition uniquely identifies that particular div, you don’t need any of the preceding path information.

    records = doc.search('div#content div.line')
    

    From each record, you can pull the date using xpath’s preceding-sibling axis. Altogether:

    doc.search('div#content div.line').each do |record|
      date = record.xpath('preceding-sibling::span[@class="date"][1]').text
      #append to CSV
    end
    

    The XPath says: find the preceding spans at the same level (preceding-sibling::span) that have a class of “date” ([@class="date"]), and take the first such one ([1]) to ensure you get the nearest date span).

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am trying to scrape table data into a CSV file. Unfortunately, I've hit
I'm trying to scrape a page where the information I'm looking for lies within:
I am trying to scrape only the test information from a web page which
I am trying to scrape a website for article titles, however this page only
I'm trying to scrape a page but the initial response has nothing in the
I am trying to scrape some data from a page with a table based
I was trying to scrape some information from a website page by page, basically
I'm trying to scrape the data off a web page using the python Windmill
I'm trying to scrape a large website of government records which requires a snowball
I'm trying to scrape an excel file from a government muster roll database. However,

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.