According to the Oniguruma documentation, the \d character type matches:
decimal digit char
Unicode: General_Category — Decimal_Number
However, scanning for \d in a string with all the Decimal_Number characters results in only latin 0-9 digits being matched:
#encoding: utf-8
require 'open-uri'
html = open("http://www.fileformat.info/info/unicode/category/Nd/list.htm").read
digits = html.scan(/U\+([\da-f]{4})/i).flatten.map{ |s| s.to_i(16) }.pack('U*')
puts digits.encoding, digits
#=> UTF-8
#=> 0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨…
p RUBY_DESCRIPTION, digits.scan(/\d/)
#=> "ruby 1.9.2p180 (2011-02-18) [i386-mingw32]"
#=> ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
Am I misreading the documentation? Why doesn’t \d match other Unicode numerals, and/or is there a way to make it do so?
Noted by Brian Candler on ruby-talk:
\wonly matches ASCII letters and digits, while[[:alpha:]]matches the full set of Unicode letters.\donly matches ASCII digits, while[[:digit:]]matches the full set of Unicode numbers.The behavior is thus ‘consistent’, and we have a simple workaround for Unicode numbers. Reading up on
\win the same Oniguruma doc we see the text:In light of the real behavior of Ruby and the “Not Unicode” text above, it would appear that the documentation is describing two modes—a Unicode mode and a Not Unicode mode—and that Ruby is operating in the Not Unicode mode.
This would explain why
\ddoes not match the full Unicode set: although the Oniguruma documentation fails to describe exactly what is matched when in Not Unicode mode, we now know that the behavior documented as “Unicode” is not to be expected.It is left as an exercise to the reader to discover how (if at all) to enable Unicode mode in Ruby regexps, as the
/uflag (e.g./\w/u) does not do it. (Perhaps Ruby must be recompiled with a special flag for Oniguruma.)Update: It would appear that the Oniguruma document I have linked to is not accurate for Ruby 1.9. See this ticket discussion, including these posts:
Better Reference: Here is official documentation on Ruby 1.9’s regexp syntax:
https://github.com/ruby/ruby/blob/trunk/doc/re.rdoc