I have a project where I have to scrape many URLs from many pages.

Question

0

Asked: June 10, 20262026-06-10T04:32:17+00:00 2026-06-10T04:32:17+00:00

I have a project where I have to scrape many URLs from many pages.

0

I have a project where I have to scrape many URLs from many pages. I thought the structure of every page would remain the same, but sometimes it changes and breaks my code.

I need to extract, for example, the abstract of an article and its keywords, both of which are in a separate <p> with the same class "marginB3". So I scraped a page and only got two results, one for the abstract and the other one for the keywords:

hxs = HtmlXPathSelector(response)
lista =  hxs.select('//p[@class="marginB3"]/text()')  
self.abstracto = lista[0].extract()
self.keywords = lista[1].extract()

I then tried with a third page and a new <p> appeared with some additional information about the article and altered the structure. That made it more complicated since there are no ids and only classes. How can I differentiate which one is the <p> for the keywords without id’s if they have their own <h2> above them:

<h2>Info</h2>
<p class="marginB3">a_url_I_want</p>

Can I do this differentiation by reading that <h2> and then the <p> below it?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-10T04:32:18+00:00

Editorial Team

2026-06-10T04:32:18+00:00Added an answer on June 10, 2026 at 4:32 am

You certainly can.

Try this:

# First <p>
hxs.select('//h2/following-sibling::p[@class="marginB3"][1]/text()').extract()
# Second <p>
hxs.select('//h2/following-sibling::p[@class="marginB3"][2]/text()').extract()

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a project where I have to scrape many URLs from many pages.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply