I am working on a simple web crawler to get a URL, Crawl first

Question

0

Asked: May 28, 20262026-05-28T04:02:48+00:00 2026-05-28T04:02:48+00:00

I am working on a simple web crawler to get a URL, Crawl first

0

I am working on a simple web crawler to get a URL, Crawl first level links on the site and extract mails from all pages using RegEx…

I know it’s kinda sloppy and it’s just the beginning, but i always get “operation Timed Out” after 2 minutes of running the script..

 private void button1_Click(object sender, System.EventArgs e)
    {

        string url = textBox1.Text;
        HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
        HttpWebResponse response = (HttpWebResponse)request.GetResponse();
        StreamReader sr = new StreamReader(response.GetResponseStream());
        string code = sr.ReadToEnd();
        string re = "href=\"(.*?)\"";
        MatchCollection href = Regex.Matches(code, @re, RegexOptions.Singleline);
        foreach (Match h in href)
        {

            string link = h.Groups[1].Value;
            if (!link.Contains("http://"))
            {
                HttpWebRequest request2 = (HttpWebRequest)WebRequest.Create(url + link);
                HttpWebResponse response2 = (HttpWebResponse)request2.GetResponse();
                StreamReader sr2 = new StreamReader(response.GetResponseStream());
                string innerlink = sr.ReadToEnd();


                MatchCollection m2 = Regex.Matches(code, @"([\w-]+(\.[\w-]+)*@([a-z0-9-]+(\.[a-z0-9-]+)*?\.[a-z]{2,6}|(\d{1,3}\.){3}\d{1,3})(:\d{4})?)", RegexOptions.Singleline);


                foreach (Match m in m2)
                {
                    string email = m.Groups[1].Value;

                    if (!listBox1.Items.Contains(email))
                    {
                        listBox1.Items.Add(email);
                    }
                }
            }
        }

         sr.Close();
        }

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-28T04:02:48+00:00

Don’t parse Html using Regex. Use the Html Agility Pack for that.

What is exactly the Html Agility Pack (HAP)?

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don’t HAVE to understand XPATH nor XSLT to use it, don’t worry…). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am working on a simple web crawler to get a URL, Crawl first

Leave an answerCancel reply

1 Answer

More Information

Leave an answer
Cancel reply