I am trying to understand Nokogiri. Does anyone have a link to a basic example of Nokogiri parse/scrape showing the resultant tree. Think it would really help my understanding.
Share
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
Using IRB and Ruby 1.9.2:
Load Nokogiri:
Parse a document:
Nokogiri likes well formed docs. Note that it added the
DOCTYPEbecause I parsed as a document. It’s possible to parse as a document fragment too, but that is pretty specialized.Search the document to find the first
<p>node using CSS and grab its content:Use a different method name to do the same thing:
Search the document for all
<p>nodes inside the<body>tag, and grab the content of the first one.searchreturns a nodeset, which is like an array of nodes.This is an important point, and one that trips up almost everyone when first using Nokogiri.
searchand itscssandxpathvariants return a NodeSet.NodeSet.textorcontentconcatenates the text of all the returned nodes into a single String which can make it very difficult to take apart again.Using a little different HTML helps illustrate this:
Returning back to the original HTML…
Change the content of the node:
Emit a parsed document as HTML:
Remove a node:
As for scraping, there are a lot of questions on SO about using Nokogiri for tearing apart HTML from sites. Searching StackOverflow for “nokogiri and open-uri” should help.