Well, start off by thinking of which bits of data…

Question

0

Asked: May 15, 20262026-05-15T21:51:05+00:00 2026-05-15T21:51:05+00:00

I am new to python and am using it to use nltk in my

0

I am new to python and am using it to use nltk in my project.After word-tokenizing the raw data obtained from a webpage I got a list containing ‘\xe2′ ,’\xe3′,’\x98’ etc.However I do not need these and want to delete them.

I simply tried

if '\x' in a

and

if a.startswith('\xe')

and it gives me an error saying invalid \x escape

But when I try a regular expression

re.search('^\\x',a)

i get

Traceback (most recent call last):
File "<pyshell#83>", line 1, in <module>
print re.search('^\\x',a)
File "C:\Python26\lib\re.py", line 142, in search
return _compile(pattern, flags).search(string)
File "C:\Python26\lib\re.py", line 245, in _compile
raise error, v # invalid expression
error: bogus escape: '\\x'

even re.search(‘^\\x’,a) is not identifying it.

I am confused by this,even googling didnt help(I might be missing something).Please suggest any simple way to remove such strings from the list and what was wrong with the above.

Thanks in advance!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-15T21:51:06+00:00

It helps here to understand the difference between a string literal and a string.

A string literal is a sequence of characters in your source code. When parsed and compiled by the Python interpreter, it produces a string, which is a sequence of characters in memory.

For example, the string literal " a " produces the string a.

String literals can take a number of forms. All of these produce the same string a:

"a"
'a'
r"a"
"""a"""
r'''a'''

Source code is traditionally ASCII-only, but we’d like it to contain string literals that can produce characters beyond ASCII. To do this escapes can be used. For example, the string literal "\xe2" produces a single-character string, with a character with integer value E2 hexadecimal, or 226 decimal.

This explains the error about "\x" being an invalid escape: the parser is expecting you to specify the hexadecimal value of a character.

To detect if a string has any characters in a certain range, you can use a regex with a character class specifying the lower and upper bounds of the characters you don’t want:

if re.search(r"[\x90-\xff]", a):

How to approach applying for a job at a company ...

What is a programmer’s life like?

How to handle personal stress caused by utterly incompetent and ...

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions