Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8243541
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 7, 20262026-06-07T21:33:32+00:00 2026-06-07T21:33:32+00:00

I have a rake task that is responsible for doing batch processing on millions

  • 0

I have a rake task that is responsible for doing batch processing on millions of URLs. Because this process takes so long I sometimes find that URLs I’m trying to process are no longer valid — 404s, site’s down, whatever.

When I initially wrote this there was basically just one site that would continually go down while processing so my solution was to use open-uri, rescue any exceptions produced, wait a bit, and then retry.

This worked fine when the dataset was smaller but now so much time goes by that I’m finding URLs are no longer there anymore and produce a 404.

Using the case of a 404, when this happens my script just sits there and loops infinitely — obviously bad.

How should I handle cases where a page doesn’t load successfully, and more importantly how does this fit into the “stack” I’ve built?

I’m pretty new to this, and Rails, so any opinions on where I might have gone wrong in this design are welcome!

Here is some anonymized code that shows what I have:

The rake task that makes a call to MyHelperModule:

# lib/tasks/my_app_tasks.rake
namespace :my_app do
  desc "Batch processes some stuff @ a later time."
    task :process_the_batch => :environment do
      # The dataset being processed
      # is millions of rows so this is a big job 
      # and should be done in batches!
      MyModel.where(some_thing: nil).find_in_batches do |my_models|
        MyHelperModule.do_the_process my_models: my_models
      end
    end
  end
end

MyHelperModule accepts my_models and does further stuff with ActiveRecord. It calls SomeClass:

# lib/my_helper_module.rb
module MyHelperModule
  def self.do_the_process(args = {})
    my_models = args[:my_models]

    # Parallel.each(my_models, :in_processes => 5) do |my_model|
    my_models.each do |my_model|
      # Reconnect to prevent errors with Postgres
      ActiveRecord::Base.connection.reconnect!
      # Do some active record stuff

      some_var = SomeClass.new(my_model.id)

      # Do something super interesting,
      # fun,
      # AND sexy with my_model
    end
  end
end

SomeClass will go out to the web via WebpageHelper and process a page:

# lib/some_class.rb
require_relative 'webpage_helper'
class SomeClass
  attr_accessor :some_data

  def initialize(arg)
    doc = WebpageHelper.get_doc("http://somesite.com/#{arg}")
      # do more stuff
  end
end

WebpageHelper is where the exception is caught and an infinite loop is started in the case of 404:

# lib/webpage_helper.rb
require 'nokogiri'
require 'open-uri'

class WebpageHelper
  def self.get_doc(url)
    begin
      page_content = open(url).read
      # do more stuff
    rescue Exception => ex
      puts "Failed at #{Time.now}"
      puts "Error: #{ex}"
      puts "URL: " + url
      puts "Retrying... Attempt #: #{attempts.to_s}"
      attempts = attempts + 1
      sleep(10)
      retry
    end
  end
end
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-07T21:33:34+00:00Added an answer on June 7, 2026 at 9:33 pm

    TL;DR

    Use out-of-band error handling and a different conceptual scraping model to speed up operations.

    Exceptions Are Not for Common Conditions

    There are a number of other answers that address how to handle exceptions for your use case. I’m taking a different approach by saying that handling exceptions is fundamentally the wrong approach here for a number of reasons.

    1. In his book Exceptional Ruby, Avdi Grimm provides some benchmarks showing the performance of exceptions as ~156% slower than using alternative coding techniques such as early returns.

    2. In The Pragmatic Programmer: From Journeyman to Master, the authors state “[E]xceptions should be reserved for unexpected events.” In your case, 404 errors are undesirable, but are not at all unexpected–in fact, handling 404 errors is a core consideration!

    In short, you need a different approach. Preferably, the alternative approach should provide out-of-band error handling and prevent your process from blocking on retries.

    One Alternative: A Faster, More Atomic Process

    You have a lot of options here, but the one I’m going to recommend is to handle 404 status codes as a normal result. This allows you to “fail fast,” but also allows you to retry pages or remove URLs from your queue at a later time.

    Consider this example schema:

    ActiveRecord::Schema.define(:version => 20120718124422) do
      create_table "webcrawls", :force => true do |t|
        t.text     "raw_html"
        t.integer  "retries"
        t.integer  "status_code"
        t.text     "parsed_data"
        t.datetime "created_at",  :null => false
        t.datetime "updated_at",  :null => false
      end
    end
    

    The idea here is that you would simply treat the entire scrape as an atomic process. For example:

    • Did you get the page?

      Great, store the raw page and the successful status code. You can even parse the raw HTML later, in order to complete your scrapes as fast as possible.

    • Did you get a 404?

      Fine, store the error page and the status code. Move on quickly!

    When your process is done crawling URLs, you can then use an ActiveRecord lookup to find all the URLs that recently returned a 404 status so that you can take appropriate action. Perhaps you want to retry the page, log a message, or simply remove the URL from your list of URLs to scrape–“appropriate action” is up to you.

    By keeping track of your retry counts, you could even differentiate between transient errors and more permanent errors. This allows you to set thresholds for different actions, depending on the frequency of scraping failures for a given URL.

    This approach also has the added benefit of leveraging the database to manage concurrent writes and share results between processes. This would allow you to parcel out work (perhaps with a message queue or chunked data files) among multiple systems or processes.

    Final Thoughts: Scaling Up and Out

    Spending less time on retries or error handling during the initial scrape should speed up your process significantly. However, some tasks are just too big for a single-machine or single-process approach. If your process speedup is still insufficient for your needs, you may want to consider a less linear approach using one or more of the following:

    • Forking background processes.
    • Using dRuby to split work among multiple processes or machines.
    • Maximizing core usage by spawning multiple external processes using GNU parallel.
    • Something else that isn’t a monolithic, sequential process.

    Optimizing the application logic should suffice for the common case, but if not, scaling up to more processes or out to more servers. Scaling out will certainly be more work, but will also expand the processing options available to you.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a rake task that is called from another rake task. In this
I have a long running rake task that is gobbling up all my system
I have a rake task that pulls and parses JSON data over an SSL
I have this little rake task: namespace :db do namespace :test do task :reset
I have a rake task that runs, quite a lot of code. At the
I have a rake task that sends out the next 'x' invitations to join
I have a rails rake task which runs just fine. I want this task
Currently, I have this Rake::Task['FEATURE=features/Authentication.feature cucumber'].invoke but id doesn't work.. says Don't know how
I have a rake task that uploads a list of files via ftp. Copying
I have a rake task that performs a shell command. When I run it

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.