I have various HTML documents that I’m trying to extract the links to: (1) other html documents, (2) image files such as .jpg, .png and .bmp. I need a regular expression to do this and cannot seem to figure it out.
Each of the html pages will have code similar to the following:
IMG style=”MARGIN-BOTTOM: 20px; MARGIN-LEFT: 20px” align=right src=”images/sample001.jpg”>
IMG style=”MARGIN-BOTTOM: 25px; MARGIN-LEFT: 25px” align=right src=”images/sample002.png”>
IMG style=”MARGIN-BOTTOM: 20px; MARGIN-LEFT: 20px” align=right src=”images/sample003.bmp”>
href=”javascript:parent.POPUP({url:’testDoc001.htm’,type:’shared’,width:600,height:645})”>
href=”javascript:parent.POPUP({url:’testDoc002.html’,type:’shared’,width:700,height:712})”>
As an example, the regular expression would operate on the above HTML and produce the resulting array:
images/sample001.jpg
images/sample002.png
images/sample003.bmp
testDoc001.htm
testDoc002.html
Can someone help me out? Thanks so much.
Save yourself the frustration and bugs that you’ll encounter trying to parse HTML with regular expressions. Use an HTML parser like HTML Agility Pack.