so i am trying to extract the email of my website using ruby mechanize

Question

0

Asked: May 12, 20262026-05-12T11:33:33+00:00 2026-05-12T11:33:33+00:00

so i am trying to extract the email of my website using ruby mechanize

0

so i am trying to extract the email of my website using ruby mechanize and hpricot.
what i a trying to do its loop on all the page of my administration side and parse the pages with hpricot.so far so good. Then I get:

Exception `Net::HTTPBadResponse' at /usr/lib/ruby/1.8/net/http.rb:2022 - wrong status line: *SOME HTML CODE HERE*

when it parse a bunch of page , its starts with a timeout and then print the html code of the page.
cant understand why? how can i debug that?
its seems like mechanize can get more than 10 page on a row ?? is it possible??
thanks



require 'logger'
require 'rubygems'
require 'mechanize'
require 'hpricot'
require 'open-uri'

class Harvester

def initialize(page)
    @page=page
    @agent = WWW::Mechanize.new{|a| a.log = Logger.new("logs.log") }
    @agent.keep_alive=false
    @agent.read_timeout=15

end

def login
    f = @agent.get( "http://****.com/admin/index.asp") .forms.first
    f.set_fields(:username => "user", :password =>"pass")
        f.submit

  end

def harvest(s)
    pageNumber=1
    #@agent.read_timeout = 
    s.upto(@page) do |pagenb|

    puts "*************************** page= #{pagenb}/#{@page}***************************************"      
    begin
        #time=Time.now
        #search=@agent.get( "http://****.com/admin/members.asp?action=search&term=&state_id=&r=500&p=#{page}")          
        extract(pagenb)

    rescue => e
        puts  "unknown #{e.to_s}"
        #puts  "url:http://****.com/admin/members.asp?action=search&term=&state_id=&r=500&p=#{page}"
        #sleep(2)
        extract(pagenb)

    rescue Net::HTTPBadResponse => e
        puts "net exception"+ e.to_s
    rescue WWW::Mechanize::ResponseCodeError => ex
        puts "mechanize error: "+ex.response_code   
    rescue Timeout::Error => e
        puts "timeout: "+e.to_s
    end


end

end
def extract(page)

      #puts search.body

            search=@agent.get( "http://***.com/admin/members.asp?action=search&term=&state_id=&r=500&p=#{page}")

            doc = Hpricot(search.body)
        #remove titles
        #~ doc.search("/html/body/div/table[2]/tr/td[2]/table[3]/tr[1]").remove 

        (doc/"/html/body/div/table[2]/tr/td[2]/table[3]//tr").each do |tr|              
            #delete the phone number from the html
            temp = tr.search("/td[2]").inner_html
            index = temp.index('<')
            email = temp[0..index-1]
            puts  email
            f=File.open("./emails", 'a')
            f.puts(email)
            f.close     
        end 

end
end
puts "starting extacting emails ... "
start =ARGV[0].to_i
h=Harvester.new(186)

h.login

h.harvest(start)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-12T11:33:33+00:00

Editorial Team

2026-05-12T11:33:33+00:00Added an answer on May 12, 2026 at 11:33 am

Mechanize puts full content of a page into history, this may cause problems when browsing through many pages. To limit the size of history, try

@mech = WWW::Mechanize.new do |agent|
  agent.history.max_size = 1
end

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

so i am trying to extract the email of my website using ruby mechanize

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply