I have a project where I have to scrape many URLs from many pages. I thought the structure of every page would remain the same, but sometimes it changes and breaks my code.
I need to extract, for example, the abstract of an article and its keywords, both of which are in a separate <p> with the same class "marginB3". So I scraped a page and only got two results, one for the abstract and the other one for the keywords:
hxs = HtmlXPathSelector(response)
lista = hxs.select('//p[@class="marginB3"]/text()')
self.abstracto = lista[0].extract()
self.keywords = lista[1].extract()
I then tried with a third page and a new <p> appeared with some additional information about the article and altered the structure. That made it more complicated since there are no ids and only classes. How can I differentiate which one is the <p> for the keywords without id’s if they have their own <h2> above them:
<h2>Info</h2>
<p class="marginB3">a_url_I_want</p>
Can I do this differentiation by reading that <h2> and then the <p> below it?
You certainly can.
Try this: