Here’s a simple scanner, that tokenizes text according to certain rules, and labels the tokens.
- What is the best way to handle unknown characters, and label them as
unknown? - Is there a recommended way/library to speed things up while
accomplishing similar results and remaining relatively simple.
Example:
import re
def alpha(scanner,token):
return token, 'a'
def numeric(scanner,token):
return token,'rn'
def punctuation(scanner,token):
return token, 'p'
def superscript(scanner,token):
return token, 'sn'
scanner = re.Scanner([
(u"[a-zA-Z]+", alpha),
(u"[.,:;!?]", punctuation),
(u"[0-9]+", numeric),
(u"[\xb9\u2070\xb3\xb2\u2075\u2074\u2077\u2076\u2079\u2078]", superscript),
(r"[\s\n]+", None), # whitespace, newline
])
tokens, _ = scanner.scan("This is a little test? With 7,9 and 6.")
print tokens
out:
[('This', 'a'), ('is', 'a'), ('a', 'a'), ('little', 'a'), ('test', 'a'),
('?', 'p'), ('With', 'a'), ('7', 'rn'), (',', 'p'), ('9', 'rn'),
('and', 'a'), ('6', 'rn'), ('.', 'p')]
ps! Defined functions will probably try to categorize the tokens further.
The
re.Scannermatches patterns in the order provided. So you can provide a very general pattern at the end to catch “unknown” characters:yields
Some of your patterns are
unicode, and one is astr. It is true that in Python2 the pattern and the strings to be matched can be eitherunicodeorstr.However, in Python3:
It is good practice, therefore, not to mix them, even in Python2.
I think your code is wonderfully simple (except for
superscriptregex. Eek!). I don’t know of a library which would make it any simpler.