I’m using lxml in Python to parse some HTML and I want to extract all link to images. The way I do it right now is:
//a[contains(@href,'.jpg') or contains(@href,'.jpeg') or ... (etc)]
There are a couple of problem with this approach:
- you have to list all possible image extensions in all cases (both “jpg” and “JPG”), wich is not elegant
- in a weird situations, the href may contain .jpg somewhere in the middle, not at the end of the string
I wanted to use regexp, but I failed:
//a[regx:match(@href,'.*\.(?:png|jpg|jpeg)')]
This returned me all links all the time …
Does anyone knows the right, elegant way to do this or what is wrong with my regexp approach ?
Instead of:
Use:
(and the same expression pattern for the other possible endings).
The above expression is the XPath 1.0 equivalent to the following XPath 2.0 expression: