I’m writing a simple regular expression parser for the output of the sensors utility

Question

0

Asked: May 28, 20262026-05-28T16:19:55+00:00 2026-05-28T16:19:55+00:00

I’m writing a simple regular expression parser for the output of the sensors utility

0

I’m writing a simple regular expression parser for the output of the sensors utility on Ubuntu. Here’s an example of a line of text I’m parsing:

temp1:        +31.0°C  (crit = +107.0°C)

And here’s the regex I’m using to match that (in Python):

temp_re = re.compile(r'(temp1:)\s+(\+|-)(\d+\.\d+)\W\WC\s+' 
                     r'\(crit\s+=\s+(\+|-)(\d+\.\d+)\W\WC\).*')

This code works as expected and matches the example text I’ve given above. The only bits I’m really interested in are the numbers, so this bit:

(\+|-)(\d+\.\d+)\W\WC

which starts by matching the + or - sign and ends by matching the °C.

My question is, why does it take two \W (non-alphanumeric) characters to match ° rather than one? Will the code break on systems where Unicode is represented differently to mine? If so, how can I make it portable?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-28T16:19:57+00:00

Possible portable solution:

Convert input data to unicode, and use re.UNICODE flag in regular expressions.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re


data = u'temp1:        +31.0°C  (crit = +107.0°C)'
temp_re = re.compile(ur'(temp1:)\s+(\+|-)(\d+\.\d+)°C\s+' 
                     ur'\(crit\s+=\s+(\+|-)(\d+\.\d+)°C\).*', flags=re.UNICODE)

print temp_re.findall(data)

Output

[(u'temp1:', u'+', u'31.0', u'+', u'107.0')]

EDIT

@netvope allready pointed this out in comments for question.

Update

Notes from J.F. Sebastian comments about input encoding:

check_output() returns binary data that sometimes can be text (that should have a known character encoding in this case and you can convert it to Unicode). Anyway ord(u’°’) == 176 so it can not be encoded using ASCII encoding.

So, to decode input data to unicode, basically* you should use encoding from system locale using locale.getpreferredencoding() e.g.:

data = subprocess.check_output(...).decode(locale.getpreferredencoding())

With data encoded correctly:

you’ll get the same output without re.UNICODE in this case.

Why basically? Because on Russian Win7 with cp1251 as preferredencoding if we have for example script.py which decodes it’s output to utf-8:

#!/usr/bin/env python
# -*- coding: utf8 -*-

print u'temp1: +31.0°C  (crit = +107.0°C)'.encode('utf-8')

And wee need to parse it’s output:

subprocess.check_output(['python', 
                         'script.py']).decode(locale.getpreferredencoding())

will produce wrong results: 'В°' instead °.

So you need to know encoding of input data, in some cases.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m writing a simple regular expression parser for the output of the sensors utility

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply