I’m trying to write a script in Ruby to parse a Wikipedia article using Nokogiri and CSS selectors. I’m a little confused about conditionals within the script though. Here’s what I have so far (page is the downloaded html using Nokogiri):
page.css('h3').each do |node|
puts node.text
end
page.css('li').each do |node|
if /\d|\D/.match(node)
puts node.text.scan(/[\d]+\D*/).first
end
end
page.css('td b').each do |node|
puts node.text
end
This all works fine. However, what I really want is something like this:
page.css('h3, li, td b').each do |node|
# if it's an h3 node, do one thing
# if it's a li node, do another thing
# else if it's a 'td b' node, do another thing
end
This would allow the page to be parsed in order, instead of going through the body three separate times. However, I’m not sure how to write those conditionals within my script.
EDIT:
So now my script is
page.css('h3, li, td b').each do |node|
case node.name
when 'h3', 'b'
puts node.text
when 'li'
if /\d|\D/.match(node)
puts node.text.scan(/[\d]+\D*/).first
end
else
next
end
end
However, it hasn’t changed the behavior. It processes them in the same order it did before (all the ‘h3’ elements, then all the ‘li’ elements, then all the ‘b’ elements).
EDIT 2:
Okay, I finally got it to work. Here was my final set of conditionals:
page.traverse do |node|
case
when 'h3' == node.name
puts node.text
when 'li' == node.name
puts node.text.scan(/[\d]+\D*/).first if /\d|\D/.match(node)
when 'b' == node.name
puts node.text if (node.parent.name == 'p' or node.parent.name == 'td')
end
end
Thanks!
You might be looking for traverse: