I am getting problem in Regex expression.
I want to get all URL(s) from the given string but don’t want to get URL(s) which is end with .jpg, .css, .js, .gif, etc.
Here is my ASP.NET C# code,
using (var client = new WebClient())
{
client.Headers[HttpRequestHeader.UserAgent] = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13";
string result = client.DownloadString(strBasicUrl);
Regex MyRegex = new Regex("http(s)?://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.Multiline | RegexOptions.CultureInvariant | RegexOptions.Compiled);
MatchCollection matches = MyRegex.Matches(result);
foreach (var item in matches)
{
litResult.Text += item.ToString() + "<br>";
}
}
I want to change this Regex expression….
If I request strBasicUrl "http://www.Microsoft.com", then it should not be result below URLs e.g. http://i.microsoft.com/en-us/homepage/shared/templates/components/hpSearch/images/searchSprite.ltr.gif http://i.microsoft.com/global/ImageStore/PublishingImages/Asset/Header/logo_skype.png
Can anybody help me in that, much appreciated.
Thanks in Advance,
Amit Prajapati
I think Mike has already answered your question, but I was thinking on this ever since you asked the question, and thanks to your question, I learnt look aheads, look behinds and negative look behinds in regular expressions.
So here is one alternative, if you don’t want to fire regular expression in a loop.
For readability, here is the regex (without escape sequence):
Assuming you are developing a crawler, your regex is not matching the relative links, and when we match relative links you should not match the links which start with javascript or #(anchors).
Here you can see, we are capturing named group the name of the group is “URL”. So to get the url part you need to use (you might be already aware):
Here is the explanation of the regex:
This way you don’t need to run second regular expression in the loop. And you will get both absolute and relative url.
Hope it helps…