I’m having some trouble with this.
I am reading in some text and trying to extract prices from it. That I am fine with, but I am trying to write some code to determine the name of the currency from the symbol in the text with if statements similar to these
if ($curr eq "\$"){
print CURRENCY "Currency: Dollars($curr)\n";
}
else {if($curr eq "£"){
print CURRENCY "Currency: Pounds($curr)\n";
}
else {if($curr eq "€"){
print CURRENCY "Currency: Euros($curr)\n";
}
Now this works for $ (which has to be escaped obviously), but not for the Pound symbol or the Euro symbol. I assume this is something to do with Unicode encoding or something similar from what my attempts to google the issue brought up but nothing I found was much assistance. I wonder if anyone can help me here!
How to talk about Unicode characters
It sounds like you are having a problem with encodings. You seem to have Unicode characters in your Perl program’s source code. You need to use this pragma (that’s a fancy way of saying a lowercase module name which acts like compiler directive):
Put that at the top of your program, and then make sure that you are actually editing it with an editor that knows to save it as UTF-8 text. You able to use the
filecommand if you have it to verify that it says that that file is in UTF-8.An alternative that doesn’t require your Perl source to be in UTF-8 is to use code point numbers or Unicode character names instead of literals. To get named Unicode characters, use this pragma:
Now you can use the
"\N{…}"notation to talk about named characters:Another way is to use the numeric code point, if you know it:
You can use the exact number in strings and patterns if you want, too:
That will free you from having to put non-ASCII in your Perl source, which is probably a good idea even though literal magic numbers like that probably isn’t. However, you still have to account for your data source being in some encoding or another. I’m going to assume it’s in some Unicode encoding, probably UTF-8. I hope it’s not CESU-8 from Oracle or Java’s “modified UTF-8”.
The Unicode ‘Currency_Symbol’ Property
The only right way to detect any arbitrary currency symbol that is represented in text by a single Unicode character is by detecting the Unicode currency symbol property,
\p{Sc}or\p{Currency_Symbol}.Those are Unicode properties, which are character classes you can use in regexes.
You’ll want to say something like
But for that to work, you have to have read in
$currfrom an input source in the:utf8encoding. In your own source, you’d say:And in a file you open you’d say one of these:
Technically, you should probably use
:encoding(utf8)except for theuse utf8;in your own source file, so that you can’t get spoofed. Don’t ask. ☹If you’re using a module like
CGI.pmorXML::Simple, it should just work — but it depends.Properties of Currency Symbol characters
Here’s the full deal:
Finding all \p{Sc} characters
And here are all 46 of the Unicode characters with the
Sca.k.a.Currency_Symbolproperty, current as of Unicode 5.2: (sorry for the formatting issues; I believe it’s due to directionality)Whereas here are the ones in the BMP that weren’t in Unicode 4.1 yet; notice how you can combine properties and negations to pull sets of Unicode characters.
If you don’t have
unicharsandunipropson your system, just send me mail, and I’ll send them to you. They’re little tiny utility programs in pure Perl, no extra modules required.