I have been working and tinkering with Nokogiri, REXML & Ruby for a month.

Question

0

Asked: May 30, 20262026-05-30T20:30:10+00:00 2026-05-30T20:30:10+00:00

I have been working and tinkering with Nokogiri, REXML & Ruby for a month.

0

I have been working and tinkering with Nokogiri, REXML & Ruby for a month. I have this giant database that I am trying to crawl. The things that I am scraping are HTML links and XML files.

There are exactly 43612 XML files that I want to crawl and store in a CSV file.

My script works if crawl maybe 500 xml files, but larger that takes too much time and it freezes or something.

I have divided the code in pieces here so it would be easy to read, the whole script/code is here: https://gist.github.com/1981074

I am using two libraries beacuse I couldn’t find a way to do this all in nokogiri. I personally find REXML easier to use.

My question: How can fix it so it wont that a week for me to crawl all this? How do I make it run faster?

HERE IS MY SCRIPT:

Require the necessary lib:

require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'rexml/document'
require 'csv'
include REXML

Create bunch of array to store that grabs data:

@urls = Array.new 
@ID = Array.new
@titleSv = Array.new
@titleEn = Array.new
@identifier = Array.new
@typeOfLevel = Array.new

Grab all the xml links from a spec site and store them in a array called @urls

htmldoc = Nokogiri::HTML(open('http://testnavet.skolverket.se/SusaNavExport/EmilExporter?GetEvent&EMILVersion=1.1&NotExpired&EEFormOfStudy=normal&EIAcademicType=UoH&SelectEI'))

htmldoc.xpath('//a/@href').each do |links|
  @urls << links.content
end

Loop throw the @urls array, and grab every element node that I want to grab with xpath.

@urls.each do |url|
  # Loop throw the XML files and grab element nodes
  xmldoc = REXML::Document.new(open(url).read)
  # Root element
  root = xmldoc.root
  # Hämtar info-id
  @ID << root.attributes["id"]
  # TitleSv
  xmldoc.elements.each("/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]"){
    |e| m = e.text 
        m = m.to_s
        next if m.empty? 
        @titleSv << m
  }

Then store them in a CSV file.

 CSV.open("eduction_normal.csv", "wb") do |row|
    (0..@ID.length - 1).each do |index|
      row << [@ID[index], @titleSv[index], @titleEn[index], @identifier[index], @typeOfLevel[index], @typeOfResponsibleBody[index], @courseTyp[index], @credits[index], @degree[index], @preAcademic[index], @subjectCodeVhs[index], @descriptionSv[index], @lastedited[index], @expires[index]]
    end
  end

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-30T20:30:12+00:00

It’s hard to pinpoint the exact problem because of the way the code is structured. Here are a few suggestions to increase the speed and structure the program so that it will be easier to find what’s blocking you.

Libraries

You’re using a lot of libraries here that probably aren’t necessary.

You use both REXML and Nokogiri. They both do the same job. Except Nokogiri is much better at it (benchmark).

Use Hashes

Instead of storing data at index in 15 arrays, have one set of hashes.

For instance,

items = Set.new

doc.xpath('//a/@href').each do |url|
  item = {}
  item[:url] = url.content
  items << item
end

items.each do |item|
  xml = Nokogiri::XML(open(item[:url]))

  item[:id] = xml.root['id']
  ...
end

Collect the data, then write to file

Now that you have your items set, you can iterate over it and write to the file. This is much faster than doing it line by line.

Be DRY

In your original code, you have the same thing repeated a dozen times. Instead of copying and pasting, try instead to abstract out the common code.

xmldoc.elements.each("/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]"){
    |e| m = e.text 
     m = m.to_s
     next if m.empty? 
     @titleSv << m
}

Move what’s common to a method

def get_value(xml, path)
   str = ''
   xml.elements.each(path) do |e|
     str = e.text.to_s
     next if str.empty?
   end

   str
end

And move anything constant to another hash

xml_paths = {
  :title_sv => "/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]",
  :title_en => "/educationInfo/titles/title[2] | /ns:educationInfo/ns:titles/ns:title[2]",
  ...
}

Now you can combine these techniques to make for much cleaner codes

item[:title_sv] = get_value(xml, xml_paths[:title_sv])
item[:title_en] = get_value(xml, xml_paths[:title_en])

I hope this helps!

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have been working and tinkering with Nokogiri, REXML & Ruby for a month.

Leave an answerCancel reply

1 Answer

Libraries

Use Hashes

Collect the data, then write to file

Be DRY

Leave an answer
Cancel reply