I am writing a program that will help me find out sites are my competitors linking to.
In order to do that, I am writing a program that will parse an HTML file, and will produce 2 lists: internal links and external links.
I will use the internal links to further crawl the website, and the external links are actually what I am looking for.
How, using .NET RegEx, do I parse an HTML file and find 1. External links. 2. Internal links.
Thanks in advance,
Eytan Levit.
Edit: In response to the question – no – I am not bound to regex, i can use any other ideas.
Don’t use a regular expression for this.
Use something like the HTML Agility Pack which is specifically designed for parsing HTML. (There’s even an example on their CodePlex homepage which finds all links in a page.)