console.log( html.match( /<a href="(.*?)">[^<]+<\/a>/g ));
Instead of returning just the urls like:
http://google, http://yahoo.com
It’s returning the entire tag:
<a href="http://google.com">Google.com</a>, <a href="http://yahoo.com">Yahoo.com</a>
Why is that the case?
You want
RegExp#execand a loop accessing the element at the match result’s1index, rather thanString.match.String.matchdoesn’t return the capture groups when there’s agflag, just an array of the elements at index0of each match, which is the whole matching string. (See Section 15.5.4.10 of the spec.)So in essence:
Live example
But this is parsing HTML with regular expressions. Here There Be Dragons.
Update re dragons, here’s a quick list of things that will defeat this regexp, off the top of my head:
aandhref, such as two spaces rather than one, a line break,class='foo', etc., etc.hrefattribute.hrefattribute at all.Anything after the
hrefattribute that also uses double quotes, e.g.:This is not to be down on your regexp, it’s just to highlight that regular expressions can’t be reliably used on their own to parse HTML. They can form part of the solution, helping you scan for tokens, but they can’t reliably do the whole job.