I am trying to create a histogram of the letters (a,b,c,etc..) on a specified web page. I plan to make the histogram itself using a hash. However, I am having a bit of a problem actually getting the HTML.
My current code:
#!/usr/local/bin/ruby
require 'net/http'
require 'open-uri'
# This will be the hash used to store the
# histogram.
histogram = Hash.new(0)
def open(url)
Net::HTTP.get(URI.parse(url))
end
page_content = open('_insert_webpage_here')
page_content.each do |i|
puts i
end
This does a good job of getting the HTML. However, it gets it all. For http://www.stackoverflow.com it gives me:
<body><h1>Object Moved</h1>This document may be found <a HREF="http://stackoverflow.com/">here</a></body>
Pretending that it was the right page, I don’t want the html tags. I’m just trying to get Object Moved and This document may be found here.
Is there any reasonably easy way to do this?
When you
require 'open-uri', you don’t need to redefineopenwith Net::HTTP.Note: this does not strip out
<tags>within the HTML document, so<html><body>x!</body></html>will have{ '<' => 4, 'h' => 2, 't' => 2, ... }instead of{ 'x' => 1, '!' => 1 }. To remove the tags, you can use something like Nokogiri (which you said was not available), or some sort of regular expression (such as the one in Dru’s answer).