This is my first attempt parsing a webpage using Nokogiri. I am trying to

Question

0

Asked: June 12, 20262026-06-12T12:07:43+00:00 2026-06-12T12:07:43+00:00

This is my first attempt parsing a webpage using Nokogiri. I am trying to

0

This is my first attempt parsing a webpage using Nokogiri.

I am trying to extract the addresses from a webpage and store them in a CSV file. So far, I’ve only been able to extract the City, State, and Zip fields.

I don’t know how to extract the facility name, address, phone, numbers, and company information. The address may contain one or two street components.

For the phone, there may be one or more phone numbers. The phone numbers may be regular numbers or fax numbers, but they are only indicated in the text as opposed to a tag. For the company, I’d like to be able to extract the URL and the name.

Each address on the page is enclosed as follows:

  <!-- address entry -->

  <div id='1234' class='address'> 

    <div class='address_header'> 
      <h1 class='header_name'>
        <strong><a href='{URL}'>Facility Name</a></strong>
      </h1>
      <h2 class='header_city'>
        New York
      </h2>
    </div> 

    <div class='address_details'> 
      <div class='info'> 
        <p class='address'>
      <span class='street'>123 ABC St</span><br />
      <span class='street'>Unit 1</span><br />
      <span class='city'>New York</span>, 
          <span class='state'>NY</span> 
          <span class='zip'>10022</span>
        </p>
        <p class='phone'>
          Phone: <span class='tel'>999.999.9999</span>
        </p>
        <p class='phone'>
          Fax: <span class='tel'>888.888.8888</span>
        </p>
        <p class='company'>
          Company: <a href='{URL}'>Company Name</a>
        </p>
      </div>  
    </div> 
  </div>  
  <!-- address entry -->

  <!-- address entry -->

  <div id='4567' class='address'> 

    <div class='address_header'> 
      <h1 class='header_name'>
        <strong><a href='{URL}'>Facility Name</a></strong>
      </h1>
      <h2 class='header_city'>
        New York
      </h2>
    </div> 

    <div class='address_details'> 
      <div class='info'> 
        <p class='address'>
      <span class='street'>456 DEF Rd</span><br />
      <span class='city'>New York</span>, 
          <span class='state'>NY</span> 
          <span class='zip'>10022</span>
        </p>
        <p class='phone'>
          Phone: <span class='tel'>555.555.5555</span>
        </p>
        <p class='company'>
          Company: <a href='{URL}'>Company Name</a>
        </p>
      </div>  
    </div> 
  </div>  
  <!-- address entry -->

Here’s my very basic set-up.

require 'nokogiri'
require 'open-uri'
require 'csv'

doc = Nokogiri::HTML(open('[URL]'))

Cities = Array.new
States = Array.new
Zips = Array.new

doc.css("p[class='address']").css("span[class='city']").each do |city|
  Cities << city.content
end

doc.css("p[class='address']").css("span[class='state']").each do |state|
    States << state.content
end

doc.css("p[class='address']").css("span[class='zip']").each do |zip|
    Zips << zip.content
end

CSV.open("myCSV.csv", "wb") do |row|
    row << ["City", "State", "Zip"]
    (0..Cities.length - 1).each do |index|
    row << [Cities[index], States[index], Zips[index]]
  end
end

Storing the information in separate arrays here seems very clunky. I’d basically like to make a row entry in a CSV table for each occurrence of the address node in the source document, and then populate it with fields if they exist:

Facility  St_1  St_2  City  State  Zip  Phone  Fax  URL  Company
========  ===== ===== ===== ====== ==== ====== ==== ==== ============
xxxxxxxx  xxxx        xxxx  xxxxx  xxxx xxxxx       xxxx xxxxxxxx
xxxxxxxx  xxxx  xxxxx xxxx  xxxxx  xxxx xxxxx  xxxx xxxx xxxxxxxx

Can someone help me?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-12T12:07:45+00:00

You probably have some edge cases that this won’t handle, but this takes care of your example. You’ll need to change the doc to read from the real page instead of the data segment, and you’ll need to change the csv to print to a file instead of display inline like I’ve done.

require 'nokogiri'
require 'open-uri'
require 'csv'

doc = Nokogiri::HTML(DATA.read)

CompanyInfo   = Struct.new :facility, :street1, :street2, :city, :state, :zip, :phone, :fax, :url, :company
company_infos = []

doc.css("div.address").each do |address_div|
  facility         = address_div.at_css('.address_header .header_name').text.strip
  info             = address_div.css('div.address_details .info')
  street1, street2 = info.css('.street').map(&:text)
  city             = info.at_css('.city').text
  state            = info.at_css('.state').text
  zip              = info.at_css('.zip').text
  phone, fax       = info.css('.phone .tel').map(&:text)
  url              = info.at_css('.company a')['href']
  company          = info.at_css('.company a').text

  company_infos << CompanyInfo.new(facility, street1, street2, city, state, zip, phone, fax, url, company)
end

csv = CSV.generate do |csv|
  csv << %w[Facility Street1 Street2 City State Zip Phone Fax URL Company]
  company_infos.each do |company_info|
    csv << company_info.to_a
  end
end

csv # => "Facility,Street1,Street2,City,State,Zip,Phone,Fax,URL,Company\nFacility Name,123 ABC St,Unit 1,New York,NY,10022,999.999.9999,888.888.8888,{URL},Company Name\n"


__END__
<!-- address entry -->

<div id='1234' class='address'> 

  <div class='address_header'> 
    <h1 class='header_name'>
      <strong><a href='{URL}'>Facility Name</a></strong>
    </h1>
    <h2 class='header_city'>
      New York
    </h2>
  </div> 

  <div class='address_details'> 
    <div class='info'> 
      <p class='address'>
        <span class='street'>123 ABC St</span><br />
        <span class='street'>Unit 1</span><br />
        <span class='city'>New York</span>, 
        <span class='state'>NY</span> 
        <span class='zip'>10022</span>
      </p>
      <p class='phone'>
        Phone: <span class='tel'>999.999.9999</span>
      </p>
      <p class='phone'>
        Fax: <span class='tel'>888.888.8888</span>
      </p>
      <p class='company'>
        Company: <a href='{URL}'>Company Name</a>
      </p>
    </div>  
  </div> 
</div>

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

This is my first attempt parsing a webpage using Nokogiri. I am trying to

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply