I’m trying to create a function that will scrape the filmography of actors from wikipedia pages. This is an example of the code
doca = Nokogiri::HTML(open("http://en.wikipedia.org/wiki/Kevin_Bacon"))
grandparent = doca.xpath('//div[@id="mw-content-text"]').children()
child = []
grandparent.each {|node|
node.children.each{|x|
if x['id'] == "Films"
child = node.next_element.children
break
end
}
}
Each element of the child array now contains one row of the filmography table. What i really want is to save the href link for each film into an array but am having trouble accessing them as they are well nested within each. Any help greatly appreciated
How about:
That selects links in italics at any depth within a column (
td) in a table directly inside adivwith idmw-content-text, then maps them to theirhrefattribute (i.e. their link value). You could be more specific, depending on what you want to include/exclude.If you want the links to be absolute and not relative, you can merge the page URL to the link value:
UPDATE:
Alternatively, if you want to do search for the links the way you described, you could do this:
This says: find all links that are children of a table inside a div with id
mw-content-textwhose direct preceding sibling has a direct child span tag withid“Films”. Somewhat more complicated.