For some reason, I’m getting unexpected results in the range comparisons of unicode characters.

Question

0

Asked: June 1, 20262026-06-01T08:49:33+00:00 2026-06-01T08:49:33+00:00

For some reason, I’m getting unexpected results in the range comparisons of unicode characters.

0

For some reason, I’m getting unexpected results in the range comparisons of unicode characters.

To summarize, in my minimized test code, ("\u1000".."\u1200") === "\u1100" is false, where I would expect it to be true — while the same test against "\u1001" is true as expected. I find this utterly incomprehensible. The results of the < operator are also interesting — they contradict ===.

The following code is a good minimal illustration:

# encoding: utf-8

require 'pp'

a = "\u1000"
b = "\u1200"

r = (a..b)

x = "\u1001"
y = "\u1100"

pp a, b, r, x, y

puts "a < x = #{a < x}"
puts "b > x = #{b > x}"

puts "a < y = #{a < y}"
puts "b > y = #{b > y}"

puts "r === x = #{r === x}"
puts "r === y = #{r === y}"

I would naively expect that both of the === operations would produce “true” here. However, the actual output of running this program is:

ruby 1.9.3p125 (2012-02-16 revision 34643) [x86_64-darwin11.3.0]
"\u1000"
"\u1200"
"\u1000".."\u1200"
"\u1001"
"\u1100"
a < x = true
b > x = true
a < y = true
b > y = true
r === x = true
r === y = false

Could someone enlighten me?

(Note I’m on 1.9.3 on Mac OS X, and I’m explicitly setting the encoding to utf-8.)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-01T08:49:34+00:00

ACTION:
I’ve submitted this behavior as bug #6258 to ruby-lang.

There’s something odd about the collation order in that range of characters

irb(main):081:0> r.to_a.last.ord.to_s(16)
=> "1036"
irb(main):082:0> r.to_a.last.succ.ord.to_s(16)
=> "1000"
irb(main):083:0> r.min.ord.to_s(16)
=> "1000"
irb(main):084:0> r.max.ord.to_s(16)
=> "1200"

The min and max for the range are the expected values from your input, but if we turn the range into an array, the last element is “\u1036” and it’s successor is “\u1000”. Under the covers, Range#=== must be enumerating the String#succ sequence rather than simple bound checking on min and max.

If we look at the source (click toggle) for Range#=== we see it dispatches to Range#include?. Range#include? source shows special handling for strings — if answer can be determined by string length alone, or all the invloved strings are ASCII, we get simple bounds checks, otherwise we dispatch to super, which means the #include? gets answered by Enumerable#include? which enumerates using Range#each which again has special handling for string and dispatches to String#upto which enumerates with String#succ.

String#succ has a bunch of special handling when the string contains is_alpha or is_digit numbers (which should not be true for U+1036), otherwise it increments the final char using enc_succ_char. At this point I lose the trail, but presumably this calculates a successor using the encoding and collation information associated with the string.

BTW, as a work around, you could use a range of integer ordinals and test against ordinals if you only care about single chars. eg:

r = (a.ord..b.ord)
r === x.ord
r === y.ord

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

For some reason, I’m getting unexpected results in the range comparisons of unicode characters.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply