I’ve working on parsing an input, which is HTML. However, I need to be able to find all href or src attributes that DON’T have a protocol such as http://, https:// or ftp:// etc on them, and when they don’t replace it with a variable that contains a protocol and domain.
So for example I want
<a href="/_mylink/goes/here">Link 1</a>
<a href="http://site.com/_myotherlink/goes/here">Link 2</a>
to return:
<a href="http://mydomain.com/_mylink/goes/here">Link 1</a>
<a href="http://site.com/_myotherlink/goes/here">Link 2</a>
I can get the whole href attribute, but I can’t seem to work out how to only match and replace IF it’s missing a protocol. I found that [^0-9] would work in a inverse/not way, but I found i couldn’t get it to work when trying it with http:// etc.
Edit:
Just to make mention of it, as it’s become obvious to me that it’s part of the ‘scope’ of this question, I want to avoid having url encodings as a result of the replacement, as I use things like {} onto some of these, and I don’t want them to have things like %7B %7D in them.
Why not use the DOM to easily replace these attributes? For example