I have the following code, thanks to another SO question/answer:
page = agent.page.search("table tbody tr").each do |row|
time = row.css("td:nth-child(1)").text.strip
source = row.css("td:nth-child(2)").text.strip
destination = row.css("td:nth-child(3)").text.strip
duration = row.css("td:nth-child(4)").text.strip
Call.create!(:time => time, :source => source, :destination => destination, :duration => duration)
end
It’s working well and when I run the rake task it correctly puts the data into the correct table row in my Rails application, however, for some reason after successfully creating a record for a row it’s also creating a blank record.
I can’t figure it out. From the looks of the code it’s issuing the create! command within each row.
You can see the full rake task at https://gist.github.com/1574942 and
the other question leading to this code is “Parse html into Rails without new record every time?“.
Based on the comment:
If you’re seeing an HTML structure like:
Then this will show the problem:
That outputs:
The reason you are getting the blank rows is because the HTML is malformed. The outside
<tr>shouldn’t be there. The fix is easy and will work with HTML that is correct also.Also, the inner
cssaccess is not quite correct, but why that is so is subtle. I’ll get to that.To fix the first, we’ll add a conditional test:
becomes:
After running, the output is now:
That’s really all you need to fix the problem, but there are some things in the code that are doing things the hard way which requires some ‘splainin’, but first here’s the code change:
From:
Change to:
Running that code outputs what you want:
so things are hunky-dory still.
Here’s the problem with your original code:
cssis an alias tosearch. Nokogiri returns a NodeSet for both.textwill return an empty string from an empty NodeSet, which you’d get for each of therow.css("td:nth-child(...)").text.stripcalls that looked at the outer<tr>. So, Nokogiri was failing to do what you wanted silently, because it was syntactically correct and logically correct given what you told it to do; It just failed to meet your expectations.Using
at, or one of its aliases, likecss_at, looks for the first matching accessor. So, theoretically we could continue to userow.at("td:nth-child(1)").text.stripwith multiple assignments for each accessor, and that would have immediately revealed you had a problem with the HTML because thetextwould have blown up. But that’s not zen-like enough.Instead, we can iterate over the cells returned in the NodeSet using
mapand let it gather the needed cell contents and strip them, then do a parallel assignment to the variables:Again, running this:
Gives me:
Retrofit that into your code and you get:
And you probably don’t need the
page =.