Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 865653
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 15, 20262026-05-15T09:40:15+00:00 2026-05-15T09:40:15+00:00

From past few days I’m trying to develop a regex that fetch all the

  • 0

From past few days I’m trying to develop a regex that fetch all the external links from the web pages given to it using grep.

Here is my grep command

grep -h -o -e "\(\(mailto:\|\(\(ht\|f\)tp\(s\?\)\)\)\://\)\{1\}\(.*\?\)" "/mnt/websites_folder/folder_to_search" -r 

now the grep seem to return everything after the external links in that given line

Example

if an html file contain something like this on same line

Google

https://yahoo.com’>Yahoo

then the given grep command return the following result

http://www.google.com">Google</a><p><a href='https://yahoo.com'>Yahoo</a></p>

the idea here is that if an html file contain more than one links(irrespective in a,img etc) in same line then the regex should fetch only the links and not all content of that line

I managed to developed the same in rubular.com
the regex is as follow

("|')(\b((ht|f)tps?:\/\/)(.*?)\b)("|')

with work with the above input
but iam not able to replicate the same in grep
can anyone help
I can’t modify the html file so don’t ask me to do that neither I can look for each specific tags and check their attributes to to get external links as it addup processing time and my application doesn’t demand that

Thank You

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-15T09:40:16+00:00Added an answer on May 15, 2026 at 9:40 am

    Try this:

    cat /path/to/file | egrep -o “(mailto|ftp|http(s)?://){1}[^’\”]+”

    egrep -o "(mailto|ftp|http(s)?://){1}[^'\"]+" /path/to/file
    

    Outputs one link per line. It assumes every link is inside single or double quotes. To exclude some certain domain links, use -v:

    egrep -o "(mailto|ftp|http(s)?://){1}[^'\"]+" /path/to/file | egrep -v "yahoo.com"
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

from the past few days i have been following a lot of tutorials regarding
For the past few days I've been writing classes that at first I thought
In the past few days I have one question: In MFC, we create an
I've spent the past few days getting frustrated with WCF, so I've decided to
I've been trying on and off for the past 6 months to get OpenCV
got a question regarding serializing classes that I've defined. I have some classes like
DELETE FROM keywords WHERE NOT EXISTS (SELECT keywords_relations.k_id FROM keywords_relations WHERE keywords.k_id = keywords_relations.k_id)
I would like to know the ways I could add more security to the
I've been hacking away on an ARM based device (Freescale i.MX27 ADS) with a
OK, I am sure some of you already know whats happening just by my

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.