Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 4599974
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 21, 20262026-05-21T23:37:04+00:00 2026-05-21T23:37:04+00:00

I’ve been working on a regular expression to pick apart a bunch of text

  • 0

I’ve been working on a regular expression to pick apart a bunch of text files that I need to parse into a database. My files are in the following format:

Lorem ipsum dolor         sit amet, consectetur adipiscing elit.

Fusce lacinia sollicitudin lectus id eleifend. Phasellus.

massa sapien, scelerisque in tincidunt et, porttitor eget ante.
In iaculis justo vel quam rhoncus volutpat. Curabitur eros est,
ultrices in elementum eget, venenatis eget mauris. Sed sollicitudin,
nibh sed varius aliquet, neque odio porttitor risus, at sollicitudin

lectus neque sit amet diam.
Aliquam condimentum sapien eu
tellus condimentum suscipit.
Pellentesque in accumsan nunc.

I’m trying to come up with the following capture groups:

  • Lorem ipsum dolor
  • sit amet, consectetur adipiscing elit.
  • Fusce lacinia sollicitudin lectus id eleifend. Phasellus.
  • massa sapien, scelerisque in tincidunt et, porttitor eget ante.
    In iaculis justo vel quam rhoncus volutpat. Curabitur eros est,
    ultrices in elementum eget, venenatis eget mauris. Sed sollicitudin,
    nibh sed varius aliquet, neque odio porttitor risus, at sollicitudin

Notes:
Everything after the multiline paragraph can be ignored. All of the groups can include letters, numbers, spaces and punctuation. I’m going to be doing some additional post-processing on the text using PHP.

My last try to capture the first 2 parts, which was closer than my other attempts but still didn’t work as intended was:

^((?:[a-zA-Z0-9!-~](?: (?! ))?)+?)(?: {2,})((?:[a-zA-Z0-9!-~](?: (?! ))?)+?)

I thought that this would start at the beginning of the file, capture everything up to the point where it encountered multiple spaces then grab the rest of the line.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-21T23:37:05+00:00Added an answer on May 21, 2026 at 11:37 pm

    Try this:

    $pattern='~\A(.+?) {2,}(.+?)\R{2,}(.+?)\R{2,}(.+?)(?:\R{2,}|\Z)~s';
    
    preg_match($pattern, $subject, $match);
    

    See it in action on ideone.com

    I’m assuming all those  ‘s in your sample text represent regular spaces, and you only used them so we could see that there was more than one space. If you been using SO’s code formatting from the beginning, that wouldn’t have been necessary. That’s the indentation style of formatting; in text formatted with backticks, whitespace still gets collapsed.

    I’m also assuming you’re reading the whole file into memory, not processing it line-by-line. The regex is pretty straightforward. Starting at the beginning of the text (\A), it reluctantly matches and captures everything it sees ((.+?), in single-line mode) until it sees two or more consecutive spaces ({2,}).

    After that, it reluctantly matches and captures until it sees two or more newlines in a row ((.+?)\R{2,}). Then it does the same thing twice more to capture the second and third paragraphs. The final (?:\R{2,}|\Z) is there in case there’s no more text after the third paragraph.

    \R, if you’re not familiar with it, is a shorthand for any kind of line separator: \n, \r, \r\n and a few other, less common ones. It’s supported by Perl, PHP (PCRE), Ruby 1.9+ (Oniguruma) and a few other flavors, but not (so far) by JavaScript, Python, Java or .NET.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a French site that I want to parse, but am running into
I'm working with an upstream system that sometimes sends me text destined for HTML/XML
I have a bunch of posts stored in text files formatted in yaml/textile (from
link Im having trouble converting the html entites into html characters, (&# 8217;) i
I'm parsing an RSS feed that has an ’ in it. SimpleXML turns this
I need a function that will clean a strings' special characters. I do NOT
I have a reasonable size flat file database of text documents mostly saved in
I have thousands of HTML files to process using Groovy/Java and I need to
I have a jquery bug and I've been looking for hours now, I can't
That's pretty much it. I'm using Nokogiri to scrape a web page what has

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.