I’ve seen the “How to get the raw HTML source code for a page by using Ruby or Nokogiri?” which uses something like this:
file = open("index.html")
puts file.read
page = Nokogiri::HTML(file)
But it seems to move the read point to the end of the file so that Nokogiri can’t read the file anymore. If I swap the read and Nokogiri call:
file = open("index.html")
puts file.read
page = Nokogiri::HTML(file)
The file is no longer output. I’d like to be able to query Nokogiri for the HTML it used originally, so that I can do my own extra parsing on the raw source. Ideally, I’d like something like
file = open("index.html")
page = Nokogiri::HTML(file)
raw_html = page.html
Note: I’ve also tried page.to_html, but it seems to change the formatting slightly.
You usually pass a
Fileinstance so it can be processed by chunks, but passing a string is also ok: