I’m using lxml in Python to parse some HTML and I want to extract

Question

Asked: May 18, 20262026-05-18T10:23:32+00:00 2026-05-18T10:23:32+00:00

I’m using lxml in Python to parse some HTML and I want to extract all link to images. The way I do it right now is:

//a[contains(@href,'.jpg') or contains(@href,'.jpeg') or ... (etc)]

There are a couple of problem with this approach:

you have to list all possible image extensions in all cases (both “jpg” and “JPG”), wich is not elegant
in a weird situations, the href may contain .jpg somewhere in the middle, not at the end of the string

I wanted to use regexp, but I failed:

//a[regx:match(@href,'.*\.(?:png|jpg|jpeg)')]

This returned me all links all the time …

Does anyone knows the right, elegant way to do this or what is wrong with my regexp approach ?

You must login to add an answer.

Need An Account,

Editorial Team · Answer 1 · 2026-05-18T10:23:33+00:00

Editorial Team

Instead of:

a[contains(@href,'.jpg')]

Use:

a[substring(@href, string-length(@href)-3)='.jpg']

(and the same expression pattern for the other possible endings).

The above expression is the XPath 1.0 equivalent to the following XPath 2.0 expression:

a[ends-with(@href, '.jpg')]

The Archive Base Latest Questions