First off, if it’s not clear from the tag, I’m doing this in PHP – but that probably doesn’t matter much.
I have this code:
$inputStr = strip_tags($inputStr);
$inputStr = preg_replace("/[^a-zA-Z\s]/", " ", $inputStr);
Which seems to remove all HTML tags and virtually all special and non-alphabetic characters perfectly. The one problem is, for some reason, it doesn’t filter out carraige return/line feeds (just the combination).
If I add this line:
$inputStr = preg_replace("/\s+/", " ", $inputStr);
at the end, however, it works great. Can someone tell me:
- Why doesn’t the first preg_replace filter out the CR/LFs?
- What this second preg_repalce is actually doing? I understand the first one for the most part, but hte second one is confusing me – it works but I don’t know why.
- Can I combine them into 1 line somehow?
Your first regex is removing all characters that are not letters or whitespace. CRLFs are whitespace, so they aren’t filtered out.
The second one is replacing whitespace with a space character. Essentially it condenses sequences of whitespace into a single space (due to the quantifier being greedy).
I suggest removing the
\sfrom the first regex, see if that works.