I have an html file with urls separated with br tags e.g. <a href=example.com/page1.html>Site1</a><br/>

Question

0

Asked: May 23, 20262026-05-23T09:37:31+00:00 2026-05-23T09:37:31+00:00

I have an html file with urls separated with br tags e.g. <a href=example.com/page1.html>Site1</a><br/>

0

I have an html file with urls separated with br tags e.g.

<a href="example.com/page1.html">Site1</a><br/>
<a href="example.com/page2.html">Site2</a><br/>
<a href="example.com/page3.html">Site3</a><br/>

Note the line break tag is <br/> instead of <br />. Scrapy is able to parse and extract the first url but fails to extract anything after that. If I put a space before the slash, it works fine. The html is malformed, but I’ve seen this error in multiple sites and since the browser is able to display it correctly, I’m hoping scrapy (or the underlying lxml / libxml2 / beautifulsoup) should also parse it correctly.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T09:37:31+00:00

lxml.html parses it fine. Just use that instead of the bundled HtmlXPathSelector.

import lxml.html as lxml

bad_html = """<a href="example.com/page1.html">Site1</a><br/>
<a href="example.com/page2.html">Site2</a><br/>
<a href="example.com/page3.html">Site3</a><br/>"""

tree = lxml.fromstring(bad_html)

for link in tree.iterfind('a'):
    print link.attrib['href']

Results in:

example.com/page1.html
example.com/page2.html
example.com/page3.html

So if you want to use this method in a CrawlSpider, you just need to write a simple (or a complex) link extractor.

Eg.

import lxml.html as lxml

class SimpleLinkExtractor:
    extract_links(self, response):
        tree = lxml.fromstring(response.body)
        links = tree.xpath('a/@href')
        return links

And then use that in your spider..

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        Rule(SimpleLinkExtractor(), callback='parse_item'),
    )

    # etc ...

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have an html file with urls separated with br tags e.g. <a href=example.com/page1.html>Site1</a><br/>

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply