This is my first attempt parsing a webpage using Nokogiri.
I am trying to extract the addresses from a webpage and store them in a CSV file. So far, I’ve only been able to extract the City, State, and Zip fields.
I don’t know how to extract the facility name, address, phone, numbers, and company information. The address may contain one or two street components.
For the phone, there may be one or more phone numbers. The phone numbers may be regular numbers or fax numbers, but they are only indicated in the text as opposed to a tag. For the company, I’d like to be able to extract the URL and the name.
Each address on the page is enclosed as follows:
<!-- address entry -->
<div id='1234' class='address'>
<div class='address_header'>
<h1 class='header_name'>
<strong><a href='{URL}'>Facility Name</a></strong>
</h1>
<h2 class='header_city'>
New York
</h2>
</div>
<div class='address_details'>
<div class='info'>
<p class='address'>
<span class='street'>123 ABC St</span><br />
<span class='street'>Unit 1</span><br />
<span class='city'>New York</span>,
<span class='state'>NY</span>
<span class='zip'>10022</span>
</p>
<p class='phone'>
Phone: <span class='tel'>999.999.9999</span>
</p>
<p class='phone'>
Fax: <span class='tel'>888.888.8888</span>
</p>
<p class='company'>
Company: <a href='{URL}'>Company Name</a>
</p>
</div>
</div>
</div>
<!-- address entry -->
<!-- address entry -->
<div id='4567' class='address'>
<div class='address_header'>
<h1 class='header_name'>
<strong><a href='{URL}'>Facility Name</a></strong>
</h1>
<h2 class='header_city'>
New York
</h2>
</div>
<div class='address_details'>
<div class='info'>
<p class='address'>
<span class='street'>456 DEF Rd</span><br />
<span class='city'>New York</span>,
<span class='state'>NY</span>
<span class='zip'>10022</span>
</p>
<p class='phone'>
Phone: <span class='tel'>555.555.5555</span>
</p>
<p class='company'>
Company: <a href='{URL}'>Company Name</a>
</p>
</div>
</div>
</div>
<!-- address entry -->
Here’s my very basic set-up.
require 'nokogiri'
require 'open-uri'
require 'csv'
doc = Nokogiri::HTML(open('[URL]'))
Cities = Array.new
States = Array.new
Zips = Array.new
doc.css("p[class='address']").css("span[class='city']").each do |city|
Cities << city.content
end
doc.css("p[class='address']").css("span[class='state']").each do |state|
States << state.content
end
doc.css("p[class='address']").css("span[class='zip']").each do |zip|
Zips << zip.content
end
CSV.open("myCSV.csv", "wb") do |row|
row << ["City", "State", "Zip"]
(0..Cities.length - 1).each do |index|
row << [Cities[index], States[index], Zips[index]]
end
end
Storing the information in separate arrays here seems very clunky. I’d basically like to make a row entry in a CSV table for each occurrence of the address node in the source document, and then populate it with fields if they exist:
Facility St_1 St_2 City State Zip Phone Fax URL Company
======== ===== ===== ===== ====== ==== ====== ==== ==== ============
xxxxxxxx xxxx xxxx xxxxx xxxx xxxxx xxxx xxxxxxxx
xxxxxxxx xxxx xxxxx xxxx xxxxx xxxx xxxxx xxxx xxxx xxxxxxxx
Can someone help me?
You probably have some edge cases that this won’t handle, but this takes care of your example. You’ll need to change the doc to read from the real page instead of the data segment, and you’ll need to change the csv to print to a file instead of display inline like I’ve done.