I’m trying to build a regexp in ruby to match alpha characters in UTF-8 like ñíóúü, etc. I know /\p{Alpha}/i works and /\p{L}/i works too but what’s the difference?
I’m trying to build a regexp in ruby to match alpha characters in UTF-8
Share
They seem to be equivalent. (Edit: sometimes, see the end of this answer)
It seems like Ruby supports
\p{Alpha}since version 1.9. In POSIX\p{Alpha}is equal to\p{L&}(for regular expressions with Unicode support; see here). This matches all characters that have an upper and lower case variant (see here). Unicase letters would not be matched (while they would be match by\p{L}.This does not seem to be true for Ruby (I picked a random Arabic character, since Arabic has a unicase alphabet):
\p{L}(any letter) matches.\p{Lu},\p{Ll},\p{Lt}don’t match. As expected.p{L&}doesn’t match. As expected.\p{Alpha}matches.Which seems to be a very good indication that
\p{Alpha}is just an alias for\p{L}in Ruby. On Rubular you can also see that\p{Alpha}was not available in Ruby 1.8.7.Note that the
imodifier is irrelevant in any case, because both\p{Alpha}and\p{L}match both upper- and lower-case characters anyway.EDIT:
A ha, there is a difference! I just found this PDF about Ruby’s new regex engine (in use as of Ruby 1.9 as stated above).
\p{Alpha}is available regardless of encoding (and will probably just match[A-Za-z]if there is no Unicode support), while\p{L}is specifically a Unicode property. That means,\p{Alpha}behaves exactly as in POSIX regexes, with the difference that here is corresponds to\p{L}, but in POSIX it corresponds to\p{L&}.