I have been trying to write in PHP using a series of regular expressions and the PHP function preg_replace.
My main aim is to tidy up the content with things like making sure the beginning of a sentence has an uppercase letter; there is a space after a comma; etc.
Some examples of the tidying I am trying to achieve:
// Remove any spaces around slashes
$content_replacements_from[] = "/\s*\/\s*/";
$content_replacements_to[] = "/";
// Remove any new lines or tabs
$content_replacements_from[] = "/[\r\n\t]/";
$content_replacements_to[] = " ";
// Remove any extra spaces
$content_replacements_from[] = "/\s{2,}/";
$content_replacements_to[] = " ";
// Tidy up joined full stops
$content_replacements_from[] = "/([a-zA-Z]{1})\s*[\.]{1}\s*([^(jpeg|jpg|png|pdf|gif|doc|xls|docx|xlsx|ppt|pptx|html|php|htm)]{1})/";
$content_replacements_to[] = "$1. $2";
// Tidy up joined commas
$content_replacements_from[] = "/([a-zA-Z0-9]{1})\s*[\,]{1}\s*([a-zA-Z0-9]{1})/";
$content_replacements_to[] = "$1, $2";
// Tidy up joined exclamation marks
$content_replacements_from[] = "/([a-zA-Z0-9]{1})\s*[\!]{1}\s*([a-zA-Z0-9]{1})/";
$content_replacements_to[] = "$1! $2";
// Tidy up joined question marks
$content_replacements_from[] = "/([a-zA-Z0-9]{1})\s*[\?]{1}\s*([a-zA-Z0-9]{1})/";
$content_replacements_to[] = "$1? $2";
// Tidy up joined semi colons
$content_replacements_from[] = "/([a-zA-Z0-9]{1})\s*[\;]{1}\s*([a-zA-Z0-9]{1})/";
$content_replacements_to[] = "$1; $2";
// Tidy up joined colons
$content_replacements_from[] = "/([a-zA-Z0-9]{1})\s*[\:]{1}\s*([a-zA-Z0-9]{1})/";
$content_replacements_to[] = "$1: $2";
// Tidy up fluid ounces
$content_replacements_from[] = "/[Ff]{1}[Ll]{1}.?\s?[Oo]{1}[Zz]{1}/";
$content_replacements_to[] = "fl oz";
// Tidy up rpm
$content_replacements_from[] = "/[Rr]{1}[Pp]{1}[Mm]{1}/";
$content_replacements_to[] = "rpm";
// Tidy up UK
$content_replacements_from[] = "/[Uu]{1}[Kk]{1}/";
$content_replacements_to[] = "UK";
// Tidy up Maxi-sense
$content_replacements_from[] = "/[Mm]{1}axi[\s\-]?[Ss]{1}ense/";
$content_replacements_to[] = "maxi-sense";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Mm]{1}axi[\s\-]?[Ss]{1}ense/";
$content_replacements_to[] = ". Maxi-sense";
$content_replacements_from[] = "/^[Mm]{1}axi[\s\-]?[Ss]{1}ense/";
$content_replacements_to[] = "Maxi-sense";
// Tidy up Side-by-side
$content_replacements_from[] = "/[Ss]{1}ide[\s\-]?[Bb]{1}y[\s\-]?[Ss]{1}ide/";
$content_replacements_to[] = "side-by-side";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Ss]{1}ide[\s\-]?[Bb]{1}y[\s\-]?[Ss]{1}ide/";
$content_replacements_to[] = ". Side-by-side";
$content_replacements_from[] = "/^[Ss]{1}ide[\s\-]?[Bb]{1}y[\s\-]?[Ss]{1}ide/";
$content_replacements_to[] = "Side-by-side";
// Tidy up extra large
$content_replacements_from[] = "/[Xx]{1}[Ll]{l}/";
$content_replacements_to[] = "extra large";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Xx]{1}[Ll]{l}/";
$content_replacements_to[] = "Extra large";
$content_replacements_from[] = "/^[Xx]{1}[Ll]{l}/";
$content_replacements_to[] = "Extra large";
// Tidy up D-radius
$content_replacements_from[] = "/[Dd]{1}[\s\-]?[Rr]{1}adius/";
$content_replacements_to[] = "D-radius";
// Tidy up A-rate
$content_replacements_from[] = "/[Aa]{1}[\s\-]?[Rr]{1}ate/";
$content_replacements_to[] = "A-rate";
// Tidy up In-column
$content_replacements_from[] = "/[Ii]{1}n[\s\-]?[Cc]{1}olum[n]?/";
$content_replacements_to[] = "in-column";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Ii]{1}n[\s\-]?[Cc]{1}olum[n]?/";
$content_replacements_to[] = "In-column";
$content_replacements_from[] = "/^[Ii]{1}n[\s\-]?[Cc]{1}olum[n]?/";
$content_replacements_to[] = "In-column";
// Tidy up kW
$content_replacements_from[] = "/[Kk]{1}[Ww]{1}/";
$content_replacements_to[] = "kW";
// Tidy up Built-in
$content_replacements_from[] = "/[Bb]{1}uilt[\s\-]?[Ii]{1}n/";
$content_replacements_to[] = "built-in";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Bb]{1}uilt[\s\-]?[Ii]{1}n/";
$content_replacements_to[] = "Built-in";
$content_replacements_from[] = "/^[Bb]{1}uilt[\s\-]?[Ii]{1}n/";
$content_replacements_to[] = "Built-in";
// Tidy up Built-under
$content_replacements_from[] = "/[Bb]{1}uilt[\s\-]?[Uu]{1}nder/";
$content_replacements_to[] = "built-under";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Bb]{1}uilt[\s\-]?[Uu]{1}nder/";
$content_replacements_to[] = "Built-under";
$content_replacements_from[] = "/^[Bb]{1}uilt[\s\-]?[Uu]{1}nder/";
$content_replacements_to[] = "Built-under";
// Tidy up Under-counter
$content_replacements_from[] = "/[Uu]{1}nder[\s\-]?[Cc]{1}ounter/";
$content_replacements_to[] = "under-counter";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Uu]{1}nder[\s\-]?[Cc]{1}ounter/";
$content_replacements_to[] = "Under-counter";
$content_replacements_from[] = "/^[Uu]{1}nder[\s\-]?[Cc]{1}ounter/";
$content_replacements_to[] = "Under-counter";
// Tidy up Under-cabinet
$content_replacements_from[] = "/[Uu]{1}nder[\s\-]?[Cc]{1}abinet/";
$content_replacements_to[] = "under-cabinet";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Uu]{1}nder[\s\-]?[Cc]{1}abinet/";
$content_replacements_to[] = "Under-cabinet";
$content_replacements_from[] = "/^[Uu]{1}nder[\s\-]?[Cc]{1}abinet/";
$content_replacements_to[] = "Under-cabinet";
// Tidy up integrated
$content_replacements_from[] = "/([a-zA-Z0-9]{1})[\s]{1}[\-]{1}[Ii]{1}ntegrated/";
$content_replacements_to[] = "$1-integrated";
// Tidy up Semi-integrated
$content_replacements_from[] = "/[Ss]{1}emi[\s\-]?[Ii]{1}ntegrated/";
$content_replacements_to[] = "semi-integrated";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Ss]{1}emi[\s\-]?[Ii]{1}ntegrated/";
$content_replacements_to[] = "Semi-integrated";
$content_replacements_from[] = "/^[Ss]{1}emi[\s\-]?[Ii]{1}ntegrated/";
$content_replacements_to[] = "Semi-integrated";
// Tidy up Fully-integrated
$content_replacements_from[] = "/[Ff]{1}ully[\s\-]?[Ii]{1}ntegrated/";
$content_replacements_to[] = "fully-integrated";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Ff]{1}ully[\s\-]?[Ii]{1}ntegrated/";
$content_replacements_to[] = "Fully-integrated";
$content_replacements_from[] = "/^[Ff]{1}ully[\s\-]?[Ii]{1}ntegrated/";
$content_replacements_to[] = "Fully-integrated";
// Tidy up Semi-automatic
$content_replacements_from[] = "/[Ss]{1}emi[\s\-]?[Aa]{1}utomatic/";
$content_replacements_to[] = "semi-automatic";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Ss]{1}emi[\s\-]?[Aa]{1}utomatic/";
$content_replacements_to[] = "Semi-automatic";
$content_replacements_from[] = "/^[Ss]{1}emi[\s\-]?[Aa]{1}utomatic/";
$content_replacements_to[] = "Semi-automatic";
// Tidy up Fully-automatic
$content_replacements_from[] = "/[Ff]{1}ully[\s\-]?[Aa]{1}utomatic/";
$content_replacements_to[] = "fully-automatic";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Ff]{1}ully[\s\-]?[Aa]{1}utomatic/";
$content_replacements_to[] = "Fully-automatic";
$content_replacements_from[] = "/^[Ff]{1}ully[\s\-]?[Aa]{1}utomatic/";
$content_replacements_to[] = "Fully-automatic";
// Tidy up Pull-out
$content_replacements_from[] = "/[Pp]{1}ull[\s\-]?[Oo]{1}ut/";
$content_replacements_to[] = "pull-out";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Pp]{1}ull[\s\-]?[Oo]{1}ut/";
$content_replacements_to[] = "Pull-out";
$content_replacements_from[] = "/^[Pp]{1}ull[\s\-]?[Oo]{1}ut/";
$content_replacements_to[] = "Pull-out";
// Tidy up including
$content_replacements_from[] = "/\s[Ii]{1}nc[l]?[\.]?\s/";
$content_replacements_to[] = " including ";
// Tidy up use
$content_replacements_from[] = "/\s[Uu]{1}se\s/";
$content_replacements_to[] = " use ";
// Tidy up ?-piece
$content_replacements_from[] = "/([2345TtYy]{1})[\s\-]?[Pp]{1}iece/";
$content_replacements_to[] = "$1-piece";
// Tidy up ?-spout
$content_replacements_from[] = "/([Cc]{1})[\s\-]?[Ss]{1}pout/";
$content_replacements_to[] = "$1-spout";
// Tidy up ?-end
$content_replacements_from[] = "/([Cc]{1})[\s\-]?[Ee]{1}nd/";
$content_replacements_to[] = "$1-end";
// Tidy up Brushed Steel
$content_replacements_from[] = "/[Bb]{1}[\-\/]{1}[Ss]{1}teel/";
$content_replacements_to[] = "brushed steel";
// Tidy up Stainless Steel
$content_replacements_from[] = "/[Ss]{1}[\-\/]{1}[Ss]{1}teel/";
$content_replacements_to[] = "stainless steel";
// Tidy up Silk Steel
$content_replacements_from[] = "/[Ss]{1}ilk[\s]?[Ss]{1}teel/";
$content_replacements_to[] = "silk steel";
// Remove trade marks
$content_replacements_from[] = "/™/";
$content_replacements_to[] = "";
// Replace long dashes
$content_replacements_from[] = "/–/";
$content_replacements_to[] = "-";
// Replace single quotes
$content_replacements_from[] = "/’/";
$content_replacements_to[] = "'";
$content_replacements_from[] = "/`/";
$content_replacements_to[] = "'";
// Tidy up m
$content_replacements_from[] = "/[\s]?[Mm]{1}etre/";
$content_replacements_to[] = "m";
// Tidy up m3
$content_replacements_from[] = "/([0-9]{1})[\s]?[Mm]{1}3/";
$content_replacements_to[] = "$1m³";
$content_replacements_from[] = "/\³\;/";
$content_replacements_to[] = html_entity_decode("³");
// Tidy up to in between numbers
$content_replacements_from[] = "/([0-9]{1})[\s]?to[\s]?([0-9]{1})/";
$content_replacements_to[] = "$1 - $2";
// Tidy up per hour
$content_replacements_from[] = "/\s[Aa]{1}nd\s[Hh]{1}[Rr]?$/";
$content_replacements_to[] = "ph";
// Tidy up l
$content_replacements_from[] = "/[\s]?[Ll]{1}itre/";
$content_replacements_to[] = "l";
// Tidy up -in
$content_replacements_from[] = "/\-[Ii]{1}n/";
$content_replacements_to[] = "-in";
// Tidy up plus
$content_replacements_from[] = "/\s[Pp]{1}lus\s/";
$content_replacements_to[] = " plus ";
// Tidy up including
$content_replacements_from[] = "/\s[Ii]{1}ncluding\s/";
$content_replacements_to[] = " including ";
// Tidy up including
$content_replacements_from[] = "/[Ii]{1}nc\s/";
$content_replacements_to[] = "Including ";
// Tidy up Push/pull
$content_replacements_from[] = "/[Pp]{1}ush\/[Pp]{1}ull/";
$content_replacements_to[] = "push/pull";
$content_replacements_from[] = "/[\.|\!|\?]{1}\s{1}[Pp]{1}ush\/[Pp]{1}ull/";
$content_replacements_to[] = "Push/pull";
$content_replacements_from[] = "/^[Pp]{1}ush\/[Pp]{1}ull/";
$content_replacements_to[] = "Push/pull";
// Tidy up +
$content_replacements_from[] = "/\s\+\s/";
$content_replacements_to[] = " and ";
// Tidy up *
$content_replacements_from[] = "/\*/";
$content_replacements_to[] = "";
// Tidy up with
$content_replacements_from[] = "/\s[Ww]{1}ith\s/";
$content_replacements_to[] = " with ";
// Tidy up without
$content_replacements_from[] = "/\s[Ww]{1}ithout\s/";
$content_replacements_to[] = " without ";
// Tidy up in
$content_replacements_from[] = "/\s[Ii]{1}n\s/";
$content_replacements_to[] = " in ";
// Tidy up of
$content_replacements_from[] = "/\s[Oo]{1}f\s/";
$content_replacements_to[] = " of ";
// Tidy up for
$content_replacements_from[] = "/\s[Ff]{1}or\s/";
$content_replacements_to[] = " for ";
// Tidy up or
$content_replacements_from[] = "/\s[Oo]{1}r\s/";
$content_replacements_to[] = " or ";
// Tidy up and
$content_replacements_from[] = "/\s[Aa]{1}nd\s/";
$content_replacements_to[] = " and ";
// Tidy up to
$content_replacements_from[] = "/\s[Tt]{1}o\s/";
$content_replacements_to[] = " to ";
// Tidy up too
$content_replacements_from[] = "/\s[Tt]{1}oo\s/";
$content_replacements_to[] = " too ";
// Tidy up &
$content_replacements_from[] = "/\s&\s/";
$content_replacements_to[] = " and ";
// Tidy up &
$content_replacements_from[] = "/\s&\s/";
$content_replacements_to[] = " and ";
// Tidy up mm
$content_replacements_from[] = "/M[Mm]{1}/";
$content_replacements_to[] = "mm";
// Tidy up ize to ise
$content_replacements_from[] = "/([a-zA-Z]{2})ize{1}/";
$content_replacements_to[] = "$1ise";
// Tidy up izer to iser
$content_replacements_from[] = "/([a-zA-Z]{2})izer{1}/";
$content_replacements_to[] = "$1iser";
// Tidy up yze to yse
$content_replacements_from[] = "/([a-zA-Z]{2})yze{1}/";
$content_replacements_to[] = "$1yse";
// Tidy up ization to isation
$content_replacements_from[] = "/([a-zA-Z]{2})ization{1}/";
$content_replacements_to[] = "$1isation";
// Tidy up times symbol
$content_replacements_from[] = "/([0-9]{1})\s*[Xx]\s*([0-9A-Za-z]{1})/";
$content_replacements_to[] = "$1 × $2";
// Tidy up times symbol
$content_replacements_from[] = "/\×\;/";
$content_replacements_to[] = html_entity_decode("×");
// Tidy up inches
$content_replacements_from[] = "/([0-9]{1})\s*[Ii]{1}nches/";
$content_replacements_to[] = "$1\"";
// Tidy up inch
$content_replacements_from[] = "/([0-9]{1})\s*[Ii]{1}nch/";
$content_replacements_to[] = "$1\"";
// Make the replacements
$content = preg_replace($content_replacements_from, $content_replacements_to, $content);
This is obviously complicated and lengthy.
Does anyone know a better way of doing it or know of a class that is out there that can do this?
I would then also want to apply this to content within HTML if possible.
Regular expressions are quite fine for text search and replace. The one’s you’ve been given show that there is room for improvements. But my answer is not about optimizing those, instead I suggest to start building your own set of
StringCleanerthat can do different stuff, but all with the same interface:Next to that, for the HTML, an idea I had is to create an
FilterIteratorthat offers access to all text-nodes, so they can be changed with any standard cleaner more easily.To apply multiple
StringCleanerat once (and to create sets of those), I used the Composite Pattern (by extending fromSplObjectStore) that is aStringCleaneron it’s own as well.The exmaple w/o the class defintions:
As the example already shows, when you hide away the complexity into concrete
StringCleanerobjects, you can start to create more dynamical rules. This can be extended by adding moreStringCleanertypes that operate on something different than regular expression, a very simple example withtrimis given inTrimCleaner.But sure, the regular expressions are very powerful, too. As you can see with the
RegexCleaner, I’ve moved each regular expressions delimiters into the class itself, so when you define the rules, you don’t need to type them over and over again. That’s just another simple example where you can simplify things when you encapsulate the replacement into a class of it’s own with a defined interface for the action.Full Example.