Here’s a simple scanner, that tokenizes text according to certain rules, and labels the

Question

0

Editorial Team

Asked: June 17, 20262026-06-17T16:23:48+00:00 2026-06-17T16:23:48+00:00

Here’s a simple scanner, that tokenizes text according to certain rules, and labels the

0

Here’s a simple scanner, that tokenizes text according to certain rules, and labels the tokens.

What is the best way to handle unknown characters, and label them as
unknown?
Is there a recommended way/library to speed things up while
accomplishing similar results and remaining relatively simple.

Example:

import re

def alpha(scanner,token):
    return token, 'a'

def numeric(scanner,token):
    return token,'rn'

def punctuation(scanner,token):
    return token, 'p'

def superscript(scanner,token):
    return token, 'sn'

scanner = re.Scanner([
    (u"[a-zA-Z]+", alpha),
    (u"[.,:;!?]", punctuation),
    (u"[0-9]+", numeric),
    (u"[\xb9\u2070\xb3\xb2\u2075\u2074\u2077\u2076\u2079\u2078]", superscript),
    (r"[\s\n]+", None), # whitespace, newline
    ])

tokens, _ = scanner.scan("This is a little test? With 7,9 and 6.")
print tokens

out:

[('This', 'a'), ('is', 'a'), ('a', 'a'), ('little', 'a'), ('test', 'a'),
 ('?', 'p'), ('With', 'a'), ('7', 'rn'), (',', 'p'), ('9', 'rn'), 
 ('and', 'a'), ('6', 'rn'), ('.', 'p')]

ps! Defined functions will probably try to categorize the tokens further.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T16:23:49+00:00

The re.Scanner matches patterns in the order provided. So you can provide a very general pattern at the end to catch “unknown” characters:

(r".", unknown)

import re

def alpha(scanner,token):
    return token, 'a'

def numeric(scanner,token):
    return token,'rn'

def punctuation(scanner,token):
    return token, 'p'

def superscript(scanner,token):
    return token, 'sn'

def unknown(scanner,token):
    return token, 'uk'

scanner = re.Scanner([
    (r"[a-zA-Z]+", alpha),
    (r"[.,:;!?]", punctuation),
    (r"[0-9]+", numeric),
    (r"[\xb9\u2070\xb3\xb2\u2075\u2074\u2077\u2076\u2079\u2078]", superscript),
    (r"[\s\n]+", None), # whitespace, newline
    (r".", unknown)
    ])

tokens, _ = scanner.scan("This is a little test? With 7,9 and 6. \xa0-\xaf")
print tokens

yields

[('This', 'a'), ('is', 'a'), ('a', 'a'), ('little', 'a'), 
('test', 'a'), ('?', 'p'), ('With', 'a'), ('7', 'rn'), (',', 'p'), 
('9', 'rn'), ('and', 'a'), ('6', 'rn'), ('.', 'p'), ('\xa0', 'uk'), 
('-', 'uk'), ('\xaf', 'uk')]

Some of your patterns are unicode, and one is a str. It is true that in Python2 the pattern and the strings to be matched can be either unicode or str.

However, in Python3:

Unicode strings and 8-bit strings cannot be mixed: that is, you cannot
match an Unicode string with a byte pattern or vice-versa

It is good practice, therefore, not to mix them, even in Python2.

I think your code is wonderfully simple (except for superscript regex. Eek!). I don’t know of a library which would make it any simpler.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Here’s a simple scanner, that tokenizes text according to certain rules, and labels the

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply