So i’m looking to scrape rapidshare.com links from websites. I have the following regular expressions to find links:
<a href=\"(http://rapidshare.com/files/(\\d+)/(.+)\\.(\\w{3,4}))\"
http://rapidshare.com/files/(\\d+)/(.+)\\.(\\w{3,4})
How can I write a regex that will exclude text that is embedded in a <a href="..."> tag. and only capture the text in >here</a>
I also have to bare in mind that not all links are embedded in href tags. Some are just displayed in plain text.
Basically is there a wway to exclude patterns in regex ?
Thanks.
To capture the inner text of an anchor tag, while ignoring all attribute text of the tag, you’d use the pattern:
The [^>]* part matches everything else in your tag up until the end of the start tag.
The (.*?) performs a non-greedy capture of the inner text.
If you want to capture anchor tag links and non-anchor tag links, then those are really two separate problems. There’s probably a regex for it, but it would be terribly complicated. You’re better off simply looking for non-anchor-tag links separately with the simple regex: