The programmer who wrote the following line probably uses a python package called regex.
UNIT = regex.compile("(?:{A}(?:'{A})?)++|-+|\S".format(A='\p{Word_Break=ALetter}'))
Can some one help explain what A='\p{Word_Break=ALetter}' and -+ means?
The
\p{property=value}operator matches on unicode codepoint properties, and is documented on the package index page you linked to:The entry matches any unicode character whose codepoint has a
Word_Breakproperty with the valueALetter(there are currently 24941 matches in the Unicode codepoint database, see the Unicode Text Segmentation, Word Boundaries chapter specifiation for details).The example you gave also uses standard python string formatting to interpolate a partial expression into the regular expression being compiled. The “{A}” part is just a placeholder for the
.format(A='...')part to fill. The end result is:The
-+sequence just matches 1 or more-dashes, just like in the pythonremodule expressions, it is not anything special, really.Now, the
++before that is more interesting. It’s a possessive quantifier, and using it prevents the regex matcher from trying out all possible permutations of the pattern. It’s a performance optimization, one that prevents catastrophic backtracking issues.