I’m trying to build an ASP.NET page that can crawl web pages and display

Question

0

Asked: May 27, 20262026-05-27T23:43:02+00:00 2026-05-27T23:43:02+00:00

I’m trying to build an ASP.NET page that can crawl web pages and display

0

I’m trying to build an ASP.NET page that can crawl web pages and display them correctly with all relevant html elements edited to include absolute URLs where appropriate.

This question has been partially answered here https://stackoverflow.com/a/2719712/696638

Using a combination of the answer above and this blog post http://blog.abodit.com/2010/03/a-simple-web-crawler-in-c-using-htmlagilitypack/ I have built the following;

public partial class Crawler : System.Web.UI.Page {
    protected void Page_Load(object sender, EventArgs e) {
        Response.Clear();

        string url = Request.QueryString["path"];

        WebClient client = new WebClient();
        byte[] requestHTML = client.DownloadData(url);
        string sourceHTML = new UTF8Encoding().GetString(requestHTML);

        HtmlDocument htmlDoc = new HtmlDocument();
        htmlDoc.LoadHtml(sourceHTML);

        foreach (HtmlNode link in htmlDoc.DocumentNode.SelectNodes("//a[@href]")) {
            if (!string.IsNullOrEmpty(link.Attributes["href"].Value)) {
                HtmlAttribute att = link.Attributes["href"];
                string href = att.Value;

                // ignore javascript on buttons using a tags
                if (href.StartsWith("javascript", StringComparison.InvariantCultureIgnoreCase)) continue;

                Uri urlNext = new Uri(href, UriKind.RelativeOrAbsolute);
                if (!urlNext.IsAbsoluteUri) {
                    urlNext = new Uri(new Uri(url), urlNext);
                    att.Value = urlNext.ToString();
                }
            }
        }

        Response.Write(htmlDoc.DocumentNode.OuterHtml);

    }
}

This only replaces the href attribute for links. By expanding this I’d like to know what the most efficient way would be to include;

href attribute for <a> elements
href attribute for <link> elements
src attribute for <script> elements
src attribute for <img> elements
action attribute for <form> elements

And any others people can think of?

Could these be found using a single call to SelectNodes with a monster xpath or would it be more efficient to call SelectNodes multiple times and iterrate through each collection?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-27T23:43:03+00:00

Editorial Team

2026-05-27T23:43:03+00:00Added an answer on May 27, 2026 at 11:43 pm

The following should work:

SelectNodes("//*[@href or @src or @action]")

and then you’d have to adapt the if statement below.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to build an ASP.NET page that can crawl web pages and display

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply