I am using watir-webdriver to scrape from a page with nested table based layout. As an example, I constructed a very small toy site at http://veryslow.staticloud.com/. To search for the innermost table, that contains the elements USSR and Brazil, I use the following code:
require "rubygems"
require "watir-webdriver"
r = Watir::Browser.new
br.goto("http://veryslow.staticloud.com/")
reg = /USSR.+Brazil/m
mytable = br.table(:text,reg).table(:text,reg).table(:text,reg).table(:text,reg).table(:text, reg).table(:text, reg)
mytable.text
I have two questions:
- Is there a better way to search for these inner tables?
- Why is it so slow? To actually locate the table (done when I call
mytable.text), it takes a substantial amount of time. For complex websites with nested table based layout, this is painfully long.
I know the nested table design is a bad idea, but if you have to read from them, is there a faster way to do that?
Whenever you’re using a Regexp to locate elements, we need to do the filtering on the Ruby side as opposed to in the browser itself. That means that for each time you call .table(:text, reg) here, we find all the tables inside the containing element, and filter through that in Ruby to find one that matches the Regexp. That’s going to be slow, especially with a page like this.