I’m attempting to scrape a page that has about 10 columns using Ruby and Nokogiri, with most of the columns being pretty straightforward by having unique class names. However, some of them have class ids that seem to have long number strings appended to what would be the standard class name.
For example, gametimes are all picked up with .eventLine-time, team names with .team-name, but this particular one has, for example:
<div class="eventLine-book-value" id="eventLineOpener-118079-19-1522-1">-3 -120</div>
.eventLine-book-value is not specific to this column, so it’s not useful. The 13 digits are different for every game, and trying something like:
def nodes_by_selector(filename,selector)
file = open(filename)
doc = Nokogiri::HTML(file)
doc.css(^selector)
end
Has left me with errors. I’ve seen ^ and ~ be used in other languages, but I’m new to this and I have tried searching for ways to pick up all data under id=eventLineOpener-XXXX to no avail.
To pick up all data under
id=eventLineOpener-XXXX, you need to pass'div[id*=eventLineOpener]'as the selector:The above method will return you an array of
Nokogiri::XML::Elementobjects havingid=eventLineOpener-XXXX.Further, to extract the content of each of these
Nokogiri::XML::Elementobjects, you need to iterate over each of these objects and use thetextmethod on those objects. For example: