I’m trying to extract all the unique characters from my utf-8 French dictionary file

Question

0

Asked: June 5, 20262026-06-05T18:06:52+00:00 2026-06-05T18:06:52+00:00

I’m trying to extract all the unique characters from my utf-8 French dictionary file

0

I’m trying to extract all the unique characters from my utf-8 French dictionary file using this Ruby code. The dictionary is 3.7 MB. For some reason it takes my decent computer about half an hour to execute. Any ideas?

c = Set.new
f = open "dict"
s = f.read
f.close

for i in 0..s.length-1
    c << s[i]
end

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-05T18:06:55+00:00

Reading in the entire file in one go before performing any computation on it prevents IO from being interleaved with computation. Furthermore, it increases memory pressure (potentially important if you’re running close to the limit of your memory) and drastically reduces cache coherency.

I wrote the following little script which executes in .3 seconds on my /usr/share/dict/words file — less than a megabyte, but still large enough to be slightly interesting:

$ cat /tmp/set.rb 
#!/usr/bin/ruby

require 'set'

c = Set.new
f = open "/usr/share/dict/words"

f.each_char do |char|
    c << char
end

p c
$ time /tmp/set.rb 
#<Set: {"A", "\n", "'", "s", "B", "M", "C", "T", "H", "I", "D", "S", "O", "L", "P", "W", "Z", "a", "c", "h", "e", "n", "l", "i", "y", "r", "o", "b", "d", "t", "u", "j", "g", "m", "p", "v", "x", "f", "k", "z", "w", "q", "ó", "ü", "á", "ö", "ñ", "E", "F", "R", "U", "N", "G", "K", "é", "ä", "Q", "è", "V", "J", "X", "ç", "ô", "í", "Y", "â", "û", "ê", "å", "Å"}>

real    0m0.341s
user    0m0.340s
sys 0m0.000s

Your program was still executing one minute later, and I gave up.

The main difference is that mine uses the built-in iterators to read into a buffer a smaller amount of the file (probably 4k-16k) and hand me a specific character each iteration through. This will re-use the same small amounts of memory over and over again and allow the CPU’s relatively small cache lines to store the entirety of the data.

Edit

With a small test case I was able to isolate the speed difference mostly to the each_char vs string sub-scripting. Jörg points out that string subscripting is an O(N) operation — because UTF-8 strings cannot be simply indexed by multiplication as one might expect, finding the Nth character means starting from the beginning. Thus, your approach is O(N^2) and mine was just O(N), and that goes much much further to explaining the performance difference. I’m finally content that we figured out the core cause.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to extract all the unique characters from my utf-8 French dictionary file

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply