I have an html file with urls separated with br tags e.g.
<a href="example.com/page1.html">Site1</a><br/>
<a href="example.com/page2.html">Site2</a><br/>
<a href="example.com/page3.html">Site3</a><br/>
Note the line break tag is <br/> instead of <br />. Scrapy is able to parse and extract the first url but fails to extract anything after that. If I put a space before the slash, it works fine. The html is malformed, but I’ve seen this error in multiple sites and since the browser is able to display it correctly, I’m hoping scrapy (or the underlying lxml / libxml2 / beautifulsoup) should also parse it correctly.
lxml.htmlparses it fine. Just use that instead of the bundled HtmlXPathSelector.Results in:
So if you want to use this method in a CrawlSpider, you just need to write a simple (or a complex) link extractor.
Eg.
And then use that in your spider..