I’m trying to create a function that will scrape the filmography of actors from

Question

0

Asked: June 13, 20262026-06-13T21:26:25+00:00 2026-06-13T21:26:25+00:00

I’m trying to create a function that will scrape the filmography of actors from

0

I’m trying to create a function that will scrape the filmography of actors from wikipedia pages. This is an example of the code

doca = Nokogiri::HTML(open("http://en.wikipedia.org/wiki/Kevin_Bacon"))

grandparent = doca.xpath('//div[@id="mw-content-text"]').children() 
child = []

grandparent.each {|node|
  node.children.each{|x|
    if x['id'] == "Films"
      child = node.next_element.children
      break
    end
  }
}

Each element of the child array now contains one row of the filmography table. What i really want is to save the href link for each film into an array but am having trouble accessing them as they are well nested within each. Any help greatly appreciated

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-13T21:26:26+00:00

How about:

doca.xpath('//div[@id="mw-content-text"]/table//td[2]//i/a').map { |a| a['href'] }

That selects links in italics at any depth within a column (td) in a table directly inside a div with id mw-content-text, then maps them to their href attribute (i.e. their link value). You could be more specific, depending on what you want to include/exclude.

If you want the links to be absolute and not relative, you can merge the page URL to the link value:

url = "http://en.wikipedia.org/wiki/Kevin_Bacon"
doca.xpath('//div[@id="mw-content-text"]/table//td[2]//a').map { |a| URI(url).merge(a['href']) }

UPDATE:

Alternatively, if you want to do search for the links the way you described, you could do this:

doca.xpath('//div[@id="mw-content-text"]//table[preceding-sibling::*[1][span[@id="Films"]]]//a').map { |a| a['href'] }

This says: find all links that are children of a table inside a div with id mw-content-text whose direct preceding sibling has a direct child span tag with id “Films”. Somewhat more complicated.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to create a function that will scrape the filmography of actors from

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply