I have a string, with a htmlentities encoded HTML code.
What I want to do is find all the paths in the document, between:
href=”XXX”, src=”XXX”.
I do have a regex expression that find all the links starting by http, https, ftp and file, and lest me iterate over it:
"/\b(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)[-A-Z0-9+&@#\/%=~_|$?!:,.]*[A-Z0-9+&@#\/%=~_|$]/i"
Any idea?
Update: Doing it with a regex isn’t reliable. The src=”..” or href=”..” statement can be part of a comment or a javascript statement. To reliable obtain the links I would suggest to use XPath:
If using a regex I would try to grab the content between the =
"of the href or src attribute. Here comes an example how to get the links from this page using a regex: