I have a string in unicode and I need to return the first N characters.
I am doing this:
result = unistring[:5]
but of course the length of unicode strings != length of characters.
Any ideas? The only solution is using re?
Edit: More info
unistring = "Μεταλλικα" #Metallica written in Greek letters
result = unistring[:1]
returns-> ?
I think that unicode strings are two bytes (char), that’s why this thing happens. If I do:
result = unistring[:2]
I get
M
which is correct,
So, should I always slice*2 or should I convert to something?
Unfortunately for historical reasons prior to Python 3.0 there are two string types. byte strings (
str) and Unicode strings (unicode).Prior to the unification in Python 3.0 there are two ways to declare a string literal:
unistring = "Μεταλλικα"which is a byte string andunistring = u"Μεταλλικα"which is a unicode string.The reason you see
?when you doresult = unistring[:1]is because some of the characters in your Unicode text cannot be correctly represented in the non-unicode string. You have probably seen this kind of problem if you ever used a really old email client and received emails from friends in countries like Greece for example.So in Python 2.x if you need to handle Unicode you have to do it explicitly. Take a look at this introduction to dealing with Unicode in Python: Unicode HOWTO