I’m writing a simple regular expression parser for the output of the sensors utility on Ubuntu. Here’s an example of a line of text I’m parsing:
temp1: +31.0°C (crit = +107.0°C)
And here’s the regex I’m using to match that (in Python):
temp_re = re.compile(r'(temp1:)\s+(\+|-)(\d+\.\d+)\W\WC\s+'
r'\(crit\s+=\s+(\+|-)(\d+\.\d+)\W\WC\).*')
This code works as expected and matches the example text I’ve given above. The only bits I’m really interested in are the numbers, so this bit:
(\+|-)(\d+\.\d+)\W\WC
which starts by matching the + or - sign and ends by matching the °C.
My question is, why does it take two \W (non-alphanumeric) characters to match ° rather than one? Will the code break on systems where Unicode is represented differently to mine? If so, how can I make it portable?
Possible portable solution:
Convert input data to unicode, and use
re.UNICODEflag in regular expressions.Output
EDIT
@netvope allready pointed this out in comments for question.
Update
Notes from J.F. Sebastian comments about input encoding:
So, to decode input data to
unicode, basically* you should use encoding from system locale usinglocale.getpreferredencoding()e.g.:With data encoded correctly:
Why basically? Because on Russian Win7 with
cp1251aspreferredencodingif we have for examplescript.pywhich decodes it’s output toutf-8:And wee need to parse it’s output:
will produce wrong results:
'В°'instead°.So you need to know encoding of input data, in some cases.