I have been working and tinkering with Nokogiri, REXML & Ruby for a month. I have this giant database that I am trying to crawl. The things that I am scraping are HTML links and XML files.
There are exactly 43612 XML files that I want to crawl and store in a CSV file.
My script works if crawl maybe 500 xml files, but larger that takes too much time and it freezes or something.
I have divided the code in pieces here so it would be easy to read, the whole script/code is here: https://gist.github.com/1981074
I am using two libraries beacuse I couldn’t find a way to do this all in nokogiri. I personally find REXML easier to use.
My question: How can fix it so it wont that a week for me to crawl all this? How do I make it run faster?
HERE IS MY SCRIPT:
Require the necessary lib:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'rexml/document'
require 'csv'
include REXML
Create bunch of array to store that grabs data:
@urls = Array.new
@ID = Array.new
@titleSv = Array.new
@titleEn = Array.new
@identifier = Array.new
@typeOfLevel = Array.new
Grab all the xml links from a spec site and store them in a array called @urls
htmldoc = Nokogiri::HTML(open('http://testnavet.skolverket.se/SusaNavExport/EmilExporter?GetEvent&EMILVersion=1.1&NotExpired&EEFormOfStudy=normal&EIAcademicType=UoH&SelectEI'))
htmldoc.xpath('//a/@href').each do |links|
@urls << links.content
end
Loop throw the @urls array, and grab every element node that I want to grab with xpath.
@urls.each do |url|
# Loop throw the XML files and grab element nodes
xmldoc = REXML::Document.new(open(url).read)
# Root element
root = xmldoc.root
# Hämtar info-id
@ID << root.attributes["id"]
# TitleSv
xmldoc.elements.each("/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]"){
|e| m = e.text
m = m.to_s
next if m.empty?
@titleSv << m
}
Then store them in a CSV file.
CSV.open("eduction_normal.csv", "wb") do |row|
(0..@ID.length - 1).each do |index|
row << [@ID[index], @titleSv[index], @titleEn[index], @identifier[index], @typeOfLevel[index], @typeOfResponsibleBody[index], @courseTyp[index], @credits[index], @degree[index], @preAcademic[index], @subjectCodeVhs[index], @descriptionSv[index], @lastedited[index], @expires[index]]
end
end
It’s hard to pinpoint the exact problem because of the way the code is structured. Here are a few suggestions to increase the speed and structure the program so that it will be easier to find what’s blocking you.
Libraries
You’re using a lot of libraries here that probably aren’t necessary.
You use both
REXMLandNokogiri. They both do the same job. ExceptNokogiriis much better at it (benchmark).Use Hashes
Instead of storing data at
indexin 15 arrays, have one set of hashes.For instance,
Collect the data, then write to file
Now that you have your
itemsset, you can iterate over it and write to the file. This is much faster than doing it line by line.Be DRY
In your original code, you have the same thing repeated a dozen times. Instead of copying and pasting, try instead to abstract out the common code.
Move what’s common to a method
And move anything constant to another hash
Now you can combine these techniques to make for much cleaner codes
I hope this helps!