Possible Duplicate:
In Python, how to list all characters matched by POSIX extended regex `[:space:]`?
How can I get a list of all whitespaces in UTF-8 in Python? Including non-breaking space etc. I’m using python 2.7.
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
unicodedata.categorywill tell you the category code for any given character; the characters you want have codeZs. There doesn’t appear to be any way to extract a list of the characters within a category except by iterating over all of them:(Note: if you do this test using Python 3.4 or later, MONGOLIAN VOWEL SEPARATOR will not appear in the list. Python 2.7 shipped with data from Unicode 5.2; this character was reclassified as general category Cf (“formatting control”) in Unicode 6.3, which is the version that Python 3.4 used for its data. See https://codeblog.jonskeet.uk/2014/12/01/when-is-an-identifier-not-an-identifier-attack-of-the-mongolian-vowel-separator/ and https://www.unicode.org/L2/L2013/13004-vowel-sep-change.pdf for more detail than you probably require.)
You may also want to include categories
ZlandZp, which addsAnd you almost certainly do want to include all of the ASCII control characters that are normally considered whitespace — for historical reasons (I presume), these are in category
Cc.The other 60-odd
Cccharacters should not be considered whitespace, even if their official name makes it sound like they are whitespace. For instance,U+0085 NEXT LINEis almost never encountered in the wild with its official meaning; it’s far more likely to be the result of an erroneous conversion from Windows-1252 to UTF-8 ofU+2026 HORIZONTAL ELLIPSIS.A closely-related question is “what does
\smatch in a Python regular expression?” Again the best available way to answer this question is to iterate over all characters:(I don’t know why
unicodedata.namedoesn’t know the control characters’ names. Again, if you do this test using Python 3.4 or later, MONGOLIAN VOWEL SEPARATOR will not appear in the list.)This is all of the
Z*characters, all of theCccharacters that are generally agreed to be whitespace, and five extra characters that are not generally agreed to be whitespace, U+001C, U+001D, U+001E, U+001F, and U+0085. Inclusion of the last group is a bug, but a largely harmless one, since using those characters for anything is also a bug.