For some reason, I’m getting unexpected results in the range comparisons of unicode characters.
To summarize, in my minimized test code, ("\u1000".."\u1200") === "\u1100" is false, where I would expect it to be true — while the same test against "\u1001" is true as expected. I find this utterly incomprehensible. The results of the < operator are also interesting — they contradict ===.
The following code is a good minimal illustration:
# encoding: utf-8
require 'pp'
a = "\u1000"
b = "\u1200"
r = (a..b)
x = "\u1001"
y = "\u1100"
pp a, b, r, x, y
puts "a < x = #{a < x}"
puts "b > x = #{b > x}"
puts "a < y = #{a < y}"
puts "b > y = #{b > y}"
puts "r === x = #{r === x}"
puts "r === y = #{r === y}"
I would naively expect that both of the === operations would produce “true” here. However, the actual output of running this program is:
ruby 1.9.3p125 (2012-02-16 revision 34643) [x86_64-darwin11.3.0]
"\u1000"
"\u1200"
"\u1000".."\u1200"
"\u1001"
"\u1100"
a < x = true
b > x = true
a < y = true
b > y = true
r === x = true
r === y = false
Could someone enlighten me?
(Note I’m on 1.9.3 on Mac OS X, and I’m explicitly setting the encoding to utf-8.)
ACTION:
I’ve submitted this behavior as bug #6258 to ruby-lang.
There’s something odd about the collation order in that range of characters
The min and max for the range are the expected values from your input, but if we turn the range into an array, the last element is “\u1036” and it’s successor is “\u1000”. Under the covers, Range#=== must be enumerating the String#succ sequence rather than simple bound checking on min and max.
If we look at the source (click toggle) for Range#=== we see it dispatches to Range#include?. Range#include? source shows special handling for strings — if answer can be determined by string length alone, or all the invloved strings are ASCII, we get simple bounds checks, otherwise we dispatch to super, which means the #include? gets answered by Enumerable#include? which enumerates using Range#each which again has special handling for string and dispatches to String#upto which enumerates with String#succ.
String#succ has a bunch of special handling when the string contains is_alpha or is_digit numbers (which should not be true for U+1036), otherwise it increments the final char using
enc_succ_char. At this point I lose the trail, but presumably this calculates a successor using the encoding and collation information associated with the string.BTW, as a work around, you could use a range of integer ordinals and test against ordinals if you only care about single chars. eg: