Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8007467
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 4, 20262026-06-04T17:50:05+00:00 2026-06-04T17:50:05+00:00

I’m extracting portions of URLs from text using a regular expression in Python. The

  • 0

I’m extracting portions of URLs from text using a regular expression in Python. The URLs I’m looking for are from a limited set of patterns so it feels like I should just able to handle them in a regex. What I’m trying to extract is the first portion of the file name (“some.file.name” in all the examples below), which can include dots, letters and digits.

These are the sorts of forms the URL can take:

http://www.example.com/some.file.name.html
http://www.example.com/some.file.name_foo.html
http://www.example.com/some.file.name(123).html
http://www.example.com/some.file.name_foo(123).html
http://www.example.com/some.file.name
http://www.example.com/some.file.name_foo
http://www.example.com/some.file.name(123)
http://www.example.com/some.file.name_foo(123)

I think I’m pretty much there with this regex:

http://www\.example\.com/([a-zA-Z0-9\.]+)(_[a-z]+)?(\(\d+\))?(\.html)?

But it includes the “.html” in the match when the URL is like the first one in the list. Is there any way of stopping this or is it a fundamental limitation of regular expressions?

I’m quite happy to remove the extension in code as it will always be the same and will never be valid as part of the file name, but it would be cleaner to do it as part of the regex match.

Edit:

I should emphasise that these URLs are in bodies of text. I can’t make any guarantees about whether there are characters before or after them or what those characters might be. I think it’s safe to assume that they won’t be numbers, letters, underscores or dots.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-04T17:50:09+00:00Added an answer on June 4, 2026 at 5:50 pm

    Regular expressions are matched greedy by default.

    Try this regexp:

    ^http://www\.example\.com/([a-zA-Z0-9\.]+?)(_[a-z]+)?(\(\d+\))?(\.html)?$
    

    Notice the extra ? added to not capture the .html in the first part. It makes the first group capture as little as neccessary to match, instead of as much as possible to match. Without the ?, the .html will be included in the first group, as the other groups are optional, and greedy matching tries to match as “early” as possible.

    P.S. Also note that I anchored the regexp using ^ and $ to always match the full line.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

For some reason, after submitting a string like this Jack’s Spindle from a text
I have a text area in my form which accepts all possible characters from
I have a bunch of posts stored in text files formatted in yaml/textile (from
link Im having trouble converting the html entites into html characters, (&# 8217;) i
That's pretty much it. I'm using Nokogiri to scrape a web page what has
I have a jquery bug and I've been looking for hours now, I can't
I have a string like this: La Torre Eiffel paragonata all’Everest What PHP function
I am reading a book about Javascript and jQuery and using one of the
I'm using v2.0 of ClassTextile.php, with the following call: $testimonial_text = $textile->TextileRestricted($_POST['testimonial']); ... and
I'm parsing an RSS feed that has an ’ in it. SimpleXML turns this

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.