Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6756811
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 26, 20262026-05-26T13:34:09+00:00 2026-05-26T13:34:09+00:00

I have been trying to write in PHP using a series of regular expressions

  • 0

I have been trying to write in PHP using a series of regular expressions and the PHP function preg_replace.

My main aim is to tidy up the content with things like making sure the beginning of a sentence has an uppercase letter; there is a space after a comma; etc.

Some examples of the tidying I am trying to achieve:

// Remove any spaces around slashes
$content_replacements_from[] = "/\s*\/\s*/";
$content_replacements_to[] = "/";

// Remove any new lines or tabs
$content_replacements_from[] = "/[\r\n\t]/";
$content_replacements_to[] = " ";

// Remove any extra spaces
$content_replacements_from[] = "/\s{2,}/";
$content_replacements_to[] = " ";

// Tidy up joined full stops
$content_replacements_from[] = "/([a-zA-Z]{1})\s*[\.]{1}\s*([^(jpeg|jpg|png|pdf|gif|doc|xls|docx|xlsx|ppt|pptx|html|php|htm)]{1})/";
$content_replacements_to[] = "$1. $2";

// Tidy up joined commas
$content_replacements_from[] = "/([a-zA-Z0-9]{1})\s*[\,]{1}\s*([a-zA-Z0-9]{1})/";
$content_replacements_to[] = "$1, $2";

// Tidy up joined exclamation marks
$content_replacements_from[] = "/([a-zA-Z0-9]{1})\s*[\!]{1}\s*([a-zA-Z0-9]{1})/";
$content_replacements_to[] = "$1! $2";

// Tidy up joined question marks
$content_replacements_from[] = "/([a-zA-Z0-9]{1})\s*[\?]{1}\s*([a-zA-Z0-9]{1})/";
$content_replacements_to[] = "$1? $2";

// Tidy up joined semi colons
$content_replacements_from[] = "/([a-zA-Z0-9]{1})\s*[\;]{1}\s*([a-zA-Z0-9]{1})/";
$content_replacements_to[] = "$1; $2";

// Tidy up joined colons
$content_replacements_from[] = "/([a-zA-Z0-9]{1})\s*[\:]{1}\s*([a-zA-Z0-9]{1})/";
$content_replacements_to[] = "$1: $2";

// Tidy up fluid ounces
$content_replacements_from[] = "/[Ff]{1}[Ll]{1}.?\s?[Oo]{1}[Zz]{1}/";
$content_replacements_to[] = "fl oz";

// Tidy up rpm
$content_replacements_from[] = "/[Rr]{1}[Pp]{1}[Mm]{1}/";
$content_replacements_to[] = "rpm";

// Tidy up UK
$content_replacements_from[] = "/[Uu]{1}[Kk]{1}/";
$content_replacements_to[] = "UK";

// Tidy up Maxi-sense
$content_replacements_from[] = "/[Mm]{1}axi[\s\-]?[Ss]{1}ense/";
$content_replacements_to[] = "maxi-sense";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Mm]{1}axi[\s\-]?[Ss]{1}ense/";
$content_replacements_to[] = ". Maxi-sense";
$content_replacements_from[] = "/^[Mm]{1}axi[\s\-]?[Ss]{1}ense/";
$content_replacements_to[] = "Maxi-sense";

// Tidy up Side-by-side
$content_replacements_from[] = "/[Ss]{1}ide[\s\-]?[Bb]{1}y[\s\-]?[Ss]{1}ide/";
$content_replacements_to[] = "side-by-side";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Ss]{1}ide[\s\-]?[Bb]{1}y[\s\-]?[Ss]{1}ide/";
$content_replacements_to[] = ". Side-by-side";
$content_replacements_from[] = "/^[Ss]{1}ide[\s\-]?[Bb]{1}y[\s\-]?[Ss]{1}ide/";
$content_replacements_to[] = "Side-by-side";

// Tidy up extra large
$content_replacements_from[] = "/[Xx]{1}[Ll]{l}/";
$content_replacements_to[] = "extra large";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Xx]{1}[Ll]{l}/";
$content_replacements_to[] = "Extra large";
$content_replacements_from[] = "/^[Xx]{1}[Ll]{l}/";
$content_replacements_to[] = "Extra large";

// Tidy up D-radius
$content_replacements_from[] = "/[Dd]{1}[\s\-]?[Rr]{1}adius/";
$content_replacements_to[] = "D-radius";

// Tidy up A-rate
$content_replacements_from[] = "/[Aa]{1}[\s\-]?[Rr]{1}ate/";
$content_replacements_to[] = "A-rate";

// Tidy up In-column
$content_replacements_from[] = "/[Ii]{1}n[\s\-]?[Cc]{1}olum[n]?/";
$content_replacements_to[] = "in-column";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Ii]{1}n[\s\-]?[Cc]{1}olum[n]?/";
$content_replacements_to[] = "In-column";
$content_replacements_from[] = "/^[Ii]{1}n[\s\-]?[Cc]{1}olum[n]?/";
$content_replacements_to[] = "In-column";

// Tidy up kW
$content_replacements_from[] = "/[Kk]{1}[Ww]{1}/";
$content_replacements_to[] = "kW";

// Tidy up Built-in
$content_replacements_from[] = "/[Bb]{1}uilt[\s\-]?[Ii]{1}n/";
$content_replacements_to[] = "built-in";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Bb]{1}uilt[\s\-]?[Ii]{1}n/";
$content_replacements_to[] = "Built-in";
$content_replacements_from[] = "/^[Bb]{1}uilt[\s\-]?[Ii]{1}n/";
$content_replacements_to[] = "Built-in";

// Tidy up Built-under
$content_replacements_from[] = "/[Bb]{1}uilt[\s\-]?[Uu]{1}nder/";
$content_replacements_to[] = "built-under";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Bb]{1}uilt[\s\-]?[Uu]{1}nder/";
$content_replacements_to[] = "Built-under";
$content_replacements_from[] = "/^[Bb]{1}uilt[\s\-]?[Uu]{1}nder/";
$content_replacements_to[] = "Built-under";

// Tidy up Under-counter
$content_replacements_from[] = "/[Uu]{1}nder[\s\-]?[Cc]{1}ounter/";
$content_replacements_to[] = "under-counter";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Uu]{1}nder[\s\-]?[Cc]{1}ounter/";
$content_replacements_to[] = "Under-counter";
$content_replacements_from[] = "/^[Uu]{1}nder[\s\-]?[Cc]{1}ounter/";
$content_replacements_to[] = "Under-counter";

// Tidy up Under-cabinet
$content_replacements_from[] = "/[Uu]{1}nder[\s\-]?[Cc]{1}abinet/";
$content_replacements_to[] = "under-cabinet";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Uu]{1}nder[\s\-]?[Cc]{1}abinet/";
$content_replacements_to[] = "Under-cabinet";
$content_replacements_from[] = "/^[Uu]{1}nder[\s\-]?[Cc]{1}abinet/";
$content_replacements_to[] = "Under-cabinet";

// Tidy up integrated
$content_replacements_from[] = "/([a-zA-Z0-9]{1})[\s]{1}[\-]{1}[Ii]{1}ntegrated/";
$content_replacements_to[] = "$1-integrated";

// Tidy up Semi-integrated
$content_replacements_from[] = "/[Ss]{1}emi[\s\-]?[Ii]{1}ntegrated/";
$content_replacements_to[] = "semi-integrated";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Ss]{1}emi[\s\-]?[Ii]{1}ntegrated/";
$content_replacements_to[] = "Semi-integrated";
$content_replacements_from[] = "/^[Ss]{1}emi[\s\-]?[Ii]{1}ntegrated/";
$content_replacements_to[] = "Semi-integrated";

// Tidy up Fully-integrated
$content_replacements_from[] = "/[Ff]{1}ully[\s\-]?[Ii]{1}ntegrated/";
$content_replacements_to[] = "fully-integrated";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Ff]{1}ully[\s\-]?[Ii]{1}ntegrated/";
$content_replacements_to[] = "Fully-integrated";
$content_replacements_from[] = "/^[Ff]{1}ully[\s\-]?[Ii]{1}ntegrated/";
$content_replacements_to[] = "Fully-integrated";

// Tidy up Semi-automatic
$content_replacements_from[] = "/[Ss]{1}emi[\s\-]?[Aa]{1}utomatic/";
$content_replacements_to[] = "semi-automatic";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Ss]{1}emi[\s\-]?[Aa]{1}utomatic/";
$content_replacements_to[] = "Semi-automatic";
$content_replacements_from[] = "/^[Ss]{1}emi[\s\-]?[Aa]{1}utomatic/";
$content_replacements_to[] = "Semi-automatic";

// Tidy up Fully-automatic
$content_replacements_from[] = "/[Ff]{1}ully[\s\-]?[Aa]{1}utomatic/";
$content_replacements_to[] = "fully-automatic";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Ff]{1}ully[\s\-]?[Aa]{1}utomatic/";
$content_replacements_to[] = "Fully-automatic";
$content_replacements_from[] = "/^[Ff]{1}ully[\s\-]?[Aa]{1}utomatic/";
$content_replacements_to[] = "Fully-automatic";

// Tidy up Pull-out
$content_replacements_from[] = "/[Pp]{1}ull[\s\-]?[Oo]{1}ut/";
$content_replacements_to[] = "pull-out";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Pp]{1}ull[\s\-]?[Oo]{1}ut/";
$content_replacements_to[] = "Pull-out";
$content_replacements_from[] = "/^[Pp]{1}ull[\s\-]?[Oo]{1}ut/";
$content_replacements_to[] = "Pull-out";

// Tidy up including
$content_replacements_from[] = "/\s[Ii]{1}nc[l]?[\.]?\s/";
$content_replacements_to[] = " including ";

// Tidy up use
$content_replacements_from[] = "/\s[Uu]{1}se\s/";
$content_replacements_to[] = " use ";

// Tidy up ?-piece
$content_replacements_from[] = "/([2345TtYy]{1})[\s\-]?[Pp]{1}iece/";
$content_replacements_to[] = "$1-piece";

// Tidy up ?-spout
$content_replacements_from[] = "/([Cc]{1})[\s\-]?[Ss]{1}pout/";
$content_replacements_to[] = "$1-spout";

// Tidy up ?-end
$content_replacements_from[] = "/([Cc]{1})[\s\-]?[Ee]{1}nd/";
$content_replacements_to[] = "$1-end";

// Tidy up Brushed Steel
$content_replacements_from[] = "/[Bb]{1}[\-\/]{1}[Ss]{1}teel/";
$content_replacements_to[] = "brushed steel";

// Tidy up Stainless Steel
$content_replacements_from[] = "/[Ss]{1}[\-\/]{1}[Ss]{1}teel/";
$content_replacements_to[] = "stainless steel";

// Tidy up Silk Steel
$content_replacements_from[] = "/[Ss]{1}ilk[\s]?[Ss]{1}teel/";
$content_replacements_to[] = "silk steel";

// Remove trade marks
$content_replacements_from[] = "/™/";
$content_replacements_to[] = "";

// Replace long dashes
$content_replacements_from[] = "/–/";
$content_replacements_to[] = "-";

// Replace single quotes
$content_replacements_from[] = "/’/";
$content_replacements_to[] = "'";
$content_replacements_from[] = "/`/";
$content_replacements_to[] = "'";

// Tidy up m
$content_replacements_from[] = "/[\s]?[Mm]{1}etre/";
$content_replacements_to[] = "m";

// Tidy up m3
$content_replacements_from[] = "/([0-9]{1})[\s]?[Mm]{1}3/";
$content_replacements_to[] = "$1m³";
$content_replacements_from[] = "/\&sup3\;/";
$content_replacements_to[] = html_entity_decode("³");

// Tidy up to in between numbers
$content_replacements_from[] = "/([0-9]{1})[\s]?to[\s]?([0-9]{1})/";
$content_replacements_to[] = "$1 - $2";

// Tidy up per hour
$content_replacements_from[] = "/\s[Aa]{1}nd\s[Hh]{1}[Rr]?$/";
$content_replacements_to[] = "ph";

// Tidy up l
$content_replacements_from[] = "/[\s]?[Ll]{1}itre/";
$content_replacements_to[] = "l";

// Tidy up -in
$content_replacements_from[] = "/\-[Ii]{1}n/";
$content_replacements_to[] = "-in";

// Tidy up plus
$content_replacements_from[] = "/\s[Pp]{1}lus\s/";
$content_replacements_to[] = " plus ";

// Tidy up including
$content_replacements_from[] = "/\s[Ii]{1}ncluding\s/";
$content_replacements_to[] = " including ";

// Tidy up including
$content_replacements_from[] = "/[Ii]{1}nc\s/";
$content_replacements_to[] = "Including "; 

// Tidy up Push/pull
$content_replacements_from[] = "/[Pp]{1}ush\/[Pp]{1}ull/";
$content_replacements_to[] = "push/pull";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Pp]{1}ush\/[Pp]{1}ull/";
$content_replacements_to[] = "Push/pull";
$content_replacements_from[] = "/^[Pp]{1}ush\/[Pp]{1}ull/";
$content_replacements_to[] = "Push/pull";

// Tidy up +
$content_replacements_from[] = "/\s\+\s/";
$content_replacements_to[] = " and ";

// Tidy up *
$content_replacements_from[] = "/\*/";
$content_replacements_to[] = "";

// Tidy up with
$content_replacements_from[] = "/\s[Ww]{1}ith\s/";
$content_replacements_to[] = " with ";

// Tidy up without
$content_replacements_from[] = "/\s[Ww]{1}ithout\s/";
$content_replacements_to[] = " without ";

// Tidy up in
$content_replacements_from[] = "/\s[Ii]{1}n\s/";
$content_replacements_to[] = " in ";

// Tidy up of
$content_replacements_from[] = "/\s[Oo]{1}f\s/";
$content_replacements_to[] = " of ";

// Tidy up for
$content_replacements_from[] = "/\s[Ff]{1}or\s/";
$content_replacements_to[] = " for ";

// Tidy up or
$content_replacements_from[] = "/\s[Oo]{1}r\s/";
$content_replacements_to[] = " or ";

// Tidy up and
$content_replacements_from[] = "/\s[Aa]{1}nd\s/";
$content_replacements_to[] = " and ";

// Tidy up to
$content_replacements_from[] = "/\s[Tt]{1}o\s/";
$content_replacements_to[] = " to ";

// Tidy up too
$content_replacements_from[] = "/\s[Tt]{1}oo\s/";
$content_replacements_to[] = " too ";

// Tidy up &
$content_replacements_from[] = "/\s&\s/";
$content_replacements_to[] = " and ";

// Tidy up &
$content_replacements_from[] = "/\s&\s/";
$content_replacements_to[] = " and ";

// Tidy up mm
$content_replacements_from[] = "/M[Mm]{1}/";
$content_replacements_to[] = "mm";

// Tidy up ize to ise
$content_replacements_from[] = "/([a-zA-Z]{2})ize{1}/";
$content_replacements_to[] = "$1ise";

// Tidy up izer to iser
$content_replacements_from[] = "/([a-zA-Z]{2})izer{1}/";
$content_replacements_to[] = "$1iser";

// Tidy up yze to yse
$content_replacements_from[] = "/([a-zA-Z]{2})yze{1}/";
$content_replacements_to[] = "$1yse";

// Tidy up ization to isation
$content_replacements_from[] = "/([a-zA-Z]{2})ization{1}/";
$content_replacements_to[] = "$1isation";

// Tidy up times symbol
$content_replacements_from[] = "/([0-9]{1})\s*[Xx]\s*([0-9A-Za-z]{1})/";
$content_replacements_to[] = "$1 × $2";

// Tidy up times symbol
$content_replacements_from[] = "/\&times\;/";
$content_replacements_to[] = html_entity_decode("×");

// Tidy up inches
$content_replacements_from[] = "/([0-9]{1})\s*[Ii]{1}nches/";
$content_replacements_to[] = "$1\"";

// Tidy up inch
$content_replacements_from[] = "/([0-9]{1})\s*[Ii]{1}nch/";
$content_replacements_to[] = "$1\"";

// Make the replacements
$content = preg_replace($content_replacements_from, $content_replacements_to, $content);

This is obviously complicated and lengthy.

Does anyone know a better way of doing it or know of a class that is out there that can do this?

I would then also want to apply this to content within HTML if possible.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-26T13:34:10+00:00Added an answer on May 26, 2026 at 1:34 pm

    Regular expressions are quite fine for text search and replace. The one’s you’ve been given show that there is room for improvements. But my answer is not about optimizing those, instead I suggest to start building your own set of StringCleaner that can do different stuff, but all with the same interface:

    interface StringCleaner
    {
        public function clean($string);
    }
    

    Next to that, for the HTML, an idea I had is to create an FilterIterator that offers access to all text-nodes, so they can be changed with any standard cleaner more easily.

    To apply multiple StringCleaner at once (and to create sets of those), I used the Composite Pattern (by extending from SplObjectStore) that is a StringCleaner on it’s own as well.

    The exmaple w/o the class defintions:

    $cleanerTrim = new TrimCleaner();
    
    $cleanerBasics = new RegexCleaner();
    
    // Remove any spaces around slashes
    $cleanerBasics->addRule('\s*\/\s*', '/');
    
    // Remove any new lines or tabs
    $cleanerBasics->addRule('[\r\n\t]', ' ');
    
    // Tidy up joined full stops
    $cleanerBasics->addRule('(\w+)\.(?!jpeg|jpg|png|pdf|gif|doc|xls|docx|xlsx|ppt|pptx|html|php|htm)(\w+)', '$1. $2');
    
    // Remove any extra spaces
    $cleanerBasics->addRule('\s{2,}', ' ');
    
    // Remove single spaces
    $cleanerBasics->addRule('^\s$', '');
    
    $cleanerInches = new RegexCleaner();
    
    // Tidy up inches
    $cleanerInches->addRule('([0-9])\s*[Ii]nches', '$1"');
    
    
    $cleaner = new CleanerComposite();
    $cleaner->attach($cleanerBasics);
    $cleaner->attach($cleanerInches);
    $cleaner->attach($cleanerTrim);
    
    
    $htmlString = <<<HTML
    <html>
      <head>
        <title>
            hello world.hello earth.
        </title>
      </head>
      <body>
    <table><tr><td>test. 
    </td></tr></table>
         <h1>Get it 1 more time.</h1>
         <p>When 12 inches were not enough;      hickup.</p>
    
      </body>
    </html>
    HTML;
    
    
    // load HTML
    $dom = new DOMDocument();
    $dom->preserveWhiteSpace = FALSE;
    $dom->loadHTML($htmlString);
    
    // create XPath
    $xpath = new DomXPath($dom);
    
    $it = new DOMTextWhiteSpaceFilter($xpath->query('//text()'));
    foreach($it as $node)
    {
        $node->data = $cleaner->clean($node->data);
    }
    
    // remove whitespace only nodes
    $it = new DOMTextWhiteSpaceFilter($xpath->query('//text()'), DOMTextWhiteSpaceFilter::WHITESPACE);
    foreach($it as $node)
    {
        $node->parentNode->removeChild($node);
    }
    
    $dom->formatOutput = true;
    echo $dom->saveHTML();
    

    As the example already shows, when you hide away the complexity into concrete StringCleaner objects, you can start to create more dynamical rules. This can be extended by adding more StringCleaner types that operate on something different than regular expression, a very simple example with trim is given in TrimCleaner.

    But sure, the regular expressions are very powerful, too. As you can see with the RegexCleaner, I’ve moved each regular expressions delimiters into the class itself, so when you define the rules, you don’t need to type them over and over again. That’s just another simple example where you can simplify things when you encapsulate the replacement into a class of it’s own with a defined interface for the action.

    Full Example.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have been trying to write a bare-bones ping scanner using Perl for internal
I have been trying to write an image on a layer using Quartz but
I am new to programming. I have been trying to write a function in
I've been trying to write a PHP function which searches the id index valeus
I have been trying to write a small app with its own option windows.
I have been trying to write my own diff3 wrap script for SVN and
I have been trying to write a regex that will remove whitespace following a
Right now I have been trying to use Launchpad's API to write a small
I have been trying to implement Win32's MessageBox using GTK. The app uses SDL/OpenGL,
I have been trying to write a set of htaccess rules that will do

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.