Afternoon all,
I am trying to write a script that will extract the first image from an article via its <img src=""/> tags. So if an article has:
<p>Lorem ipsum dolor sit amet, labore et dolore magna aliqua.<img src="example.jpg"/> Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.</p>
I would like to extract the whole image tag, <img src="example.jpg"/>.
I found this regex which extracts just the location of the image:
content_to_extract_from[/img.*?src="(.*?)"/i,1]
produces, “example.jpg”.
Does anyone know a regex that will capture the tags aswell?
Thanks in advance, Andy
Using regexes to parse markup is asking for trouble. You can probably write something that mostly works but which breaks on cases you didn’t foresee. For example you can enclose attributes with single quotes instead of double quotes, which your regex won’t handle
Much more reliable is to use a real parser, such as nokogiri