Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 1092225
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 16, 20262026-05-16T23:38:58+00:00 2026-05-16T23:38:58+00:00

I am working on a Rails 3 project that relies heavily on screen scraping

  • 0

I am working on a Rails 3 project that relies heavily on screen scraping to collect data mainly using Nokogiri. I’m aggregating essentially all the same data but I’m grabbing it from many difference sources and as time goes on I will be adding more and more. However I am acutely aware that screen scraping can be notoriously unreliable.

As such I am interested in how other people have handled the problem of verifying the data and then also getting notified if it is failing.

My current plan is as follow.

  1. I am going to have validation on my model for most of the fields. If they fail I won’t get bad data into my system. Although logging this failure in a meaningful way is still a problem.

  2. I was thinking of some kind of counter where after so many failures from a particular source I somehow turn it off. Not sure how to keep track of that. I guess the only way is to have a field on my Source model that counts it and can be reset.

  3. Logging is 800 pound gorilla I’m not sure how to deal with. I could just do standard writing to logs but if something fails I’d like to store the entire html so I can figure it out. Also I need to notify myself somehow so I can address the issues. I thought of maybe just creating a model for all this and storing it in the database. If I did this I’d probably have to store the html on s3 or something. I’m running this on heroku so that influences what I can do.

  4. Setup begin and rescue blocks around every field. I was trying to figure out a to code this in a nicer ruby way so I just don’t have a page of them but although I do have some fields are just straight up doc.css_at(“#whatever”) there are quite a number that require various formatting or calculations so I think it makes sense to try to rescue that so I can then log what went wrong. The other option is to let the exception bubble up and catch it when I try to create the model.

Anyway I’m sure I’m not even thinking of everything but that is why I’m trying to figure out how other people have handled this problem.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-16T23:38:59+00:00Added an answer on May 16, 2026 at 11:38 pm

    Our team does something similar to this, so here’s some ideas:

    • we use a really high level begin/rescue transaction to make sure we don’t get into weird half loaded states:
    begin
      ActiveRecord::Base.transaction do
        ...try to load a data source...
      end
    rescue
      ...error handling...
    end
    
    • Email/page yourself when certain errors occur. We use exception_notifier but if you’re sitting on Heroku the Exceptional plugin also seems like a good option. I’ve also heard of people having success w/ hoptoad

    • Capturing state is VERY important for troubleshooting issues. Something that’s worked quite well for us is GMail. Our loaders effectively have two phases:

      1. capture data and send it to our gmail account
      2. log into gmail, download latest data and parse it

    The second phase is the complex one, and if it fails a developer can simply log into the gmail account and easily inspect the failed message. This process has some limitations (per email and per mailbox storage limits, two phase pipeline, etc.) and we started out doing it because we had no other option, but it’s proven shockingly resilient and convenient. Keep email in mind as a cheap/easy way to store noncritical state. We didn’t start out thinking of using it that way and are now really glad we do. Logging into GMail feels better than digging through log files.

    • Build a dashboard UI. We have a simple dashboard with a grid of sources by day that looks like this. Each box is colored either red or green based on whether the load for that source on that day succeeded. You can go one step further and set up a monitor on this UI (mon.itor.us or equivalent) that alarms if some error threshold is met.
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm working on a Rails project OldApp that's using STI for some class Foo.
I've been using Rails 2.3.8 for a project that I've been working on, but
I'm working on a Rails project that requires using an XMLRPC protocol to access
Hello Ruby/Rails/Merb developers! Im currently working on a web project that will have a
I'm using the will_paginate gem for my Rails project and it's working beautifully. Unfortunately,
im working in a project writen in ruby on rails and im currently using
we are currently working on a rails project that uses i18n and we have
We're working on a Rails project on Heroku that needs to scrape and process
I am working on ruby rails project. I am using Rails 2.3.4 and ruby
I'm working with a project that uses twitter-bootstrap-rails. This project was built on a

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.