Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 3342300
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 18, 20262026-05-18T00:49:20+00:00 2026-05-18T00:49:20+00:00

I am building a web crawler in PHP, meant for Intranet use (we’re dealing

  • 0

I am building a web crawler in PHP, meant for Intranet use (we’re dealing with a huge Intranet). I managed to download a web page using the cURL functions, but now I want to scan the content for links. I am trying to find all obvious links and split them in their corresponding scheme/authority/path/query/fragment so I can index them properly.

Is there a known regular expression that matches all the links, including the ones like <img src="../images/header/logo.png" />, background-image: url(..) and <a href="?query#lonely-fragment">.

What are all the plain-text link representations that I can find using regular expressions in PHP?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-18T00:49:21+00:00Added an answer on May 18, 2026 at 12:49 am

    You will be better off parsing documents using a proper HTML parser. Regex is not really suited for this kind of thing.

    Once you have done that, it’s fairly trivial using XPath to scan for e.g. //img/@src or //a/@href to find all of the content links in the document itself.

    If you want to scan CSS, you will also need to look for //style[@type='text/css'] and //link[@rel='stylesheet'][@type='text/css']/@href and then use a proper CSS parser to extract all of the content. (Or, if you want to be lazy, you could probably get away with the regex /url\((.*?)\)/.)

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am building a web application crawler that's meant not only to find all
I am building a web crawler, and one of its functions is to download
I am currently building a web site and I just implemented SqlCacheDependency using LinqToSQL
I've been building .NET web applications for many years now, and I never use
im building a large-scale web crawler, how many instances is optimal when crawling a
I am building a Web service using WCF as a way to provide access
I am building web applications using Ruby on Rails and I would like to
I am interested in building web applications using Ruby on Rails. I surfed the
I am building a web crawler in .Net which executes approx 500 httpwebrequests at
I'm building web app that needs to communicate with another application using socket connections.

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.