I have a similar problem to the one that is answered in this post.
When I test the regex provided as the answer in that post it worked as expected:
$str = 'Days - £9.20 to £11.20 Sat - £11.80 Sun - £13.30';
preg_match_all("/£\s*\d+(?:\.\d+)?/", $str, $matches);
print_r($matches);
// Produces
Array
(
[0] => Array
(
[0] => £9.20
[1] => £10.20
[2] => £11.80
)
)
The problem comes when I try to use this to process data from a CSV that i’ve converted to an array in a foreach loop:
foreach($arrJobs as $job)
{
$str = $job['payDetails1'] . ' ' . $job['payDetails2'];
// Try to find salary from string
preg_match_all("/£\s*\d+(?:\.\d+)?/", $str, $matches);
print_r($matches);
}
// In this example the output from every item is an empty array:
Array
(
[0] => Array
(
)
)
The string I used to test the function in the first example was gotten by echoing out the value of $str in the second example and copying and pasting it.
I don’t understand why the same string returns different results? And why when I paste the string into a variable it works fine but when the string is retrieved from the CSV no matches are found?
[Answer derived from comments and feedback above]
The problem
The problem here is that your source file and your CSV input are not saved with the same character encoding.
All built-in string functions in PHP (including the PCRE functions when the
/uflag is not used) operate blindly on sequences of bytes and do not recognize characters as such. This means that for scripts that contain characters outside the ASCII range the runtime behavior will change depending on which encoding the script is saved in, since these characters will be converted to bytes differently for each and every encoding used in practice. Your script contains one such character: the pound sign.A quick solution
Assuming that the possible encodings in play here are ISO 8859-1 (Western European) and UTF-8, all the remaining characters matched by your regular expression have the same representation in both encodings so they will present no problem. So let’s see what we can do about the pound sign.
Typically you would solve this problem by replacing the literal
£with an alternation group that covers all of the character’s encodings.So that would be
(\xa3|\xc2\xa3)— the first part covers ISO 8859-1 and the second UTF-8. However, seeing as both parts end in\xa3the same result can be also had with\xc2?\xa3(making the\xc2prefix optional).Therefore you can solve your problem in a somewhat quick and dirty manner by changing the code to
A better solution
The best solution however would be to always work in UTF-8. To do this, you would need to
iconvto do this)This way you can go back to saving a literal pound sign in your script, and still be safe in the knowledge that it will work correctly no matter what the input encoding for your CSV data.