In a Ruby project that I have been spending some time on lately, I have been counting the intersection of two large sets of strings.
From what I thought I understood, I decided that it would make a lot of sense to compare integers instead of strings (all of these strings are being held in a database, and i could easily just swap them out for ids)
When i actually did the benchmarking, i ended up finding the complete opposite.
First i generated sets of 850 strings, and sets of ~850 large integers:
r = Random.new
w1 = (1..850).collect{|i| w="";(0..3).collect{|j| (rand*26 + 10).to_i.to_s(35)}.each{|l| w+=(l.to_s)};w}.to_set
w2 = (1..850).collect{|i| w="";(0..3).collect{|j| (rand*26 + 10).to_i.to_s(35)}.each{|l| w+=(l.to_s)};w}.to_set
i1 = (1..2000).collect{|i| (r.rand*1000).to_i**2}.to_set;
i2 = (1..2000).collect{|i| (r.rand*1000).to_i**2}.to_set;
And then i timed the comparisons:
t=Time.now;(0..1000).each {|i| w1 & w2};Time.now-t
=> 0.301727
t=Time.now;(0..1000).each {|i| i1 & i2};Time.now-t
=> 0.70151
Which i thought was crazy! I always thought integer comparison was much faster..
So i was wondering if anybody in the world of stacks knew anything about why the string comparison is so much faster in ruby, i would really appreciate hearing your thoughts.
The speed of the set intersection operation appears to be affected by the number of intersecting elements.
Your integer creation code is creating a substantially larger number of intersecting elements, probably because it’s selecting 2000 entries from a smaller set (1000).
In one test, for example, 755 of the 857 entries in i1 were duplicated in i2, but only 2 of the 849 entries in w1 were duplicated in w2.
When I ran a simple alteration:
(dumping 755 items into w2 that are known to be in w1), the results on my system showed the string set operation to be much closer to the equivalent integer operation.
My original results were:
My results after making the two sets of sets more alike in terms of intersecting elements, via:
were:
I hope that helps some; the two timings are within what I would consider a margin of error that other things on the system could be causing the difference. They are, essentially, equal for strings of this length.