I am processing a bunch of data and I haven’t coded a duplicate checker

Question

0

Asked: May 26, 20262026-05-26T02:47:08+00:00 2026-05-26T02:47:08+00:00

I am processing a bunch of data and I haven’t coded a duplicate checker

0

I am processing a bunch of data and I haven’t coded a duplicate checker into the data processor yet, so I expected duplicates to occur. I ran the following SQL query:

SELECT     body, COUNT(body) AS dup_count 
FROM         comments
GROUP BY body
HAVING     (COUNT(body) > 1)

And get back a list of duplicates. Looking into this I find that these duplicates have multiple hashes. The shortest string of a comment is "[deleted]". So let’s use that as an example. In my database there are nine instances of a comment being "[deleted]" and in my database this produces a hash of both 1169143752200809218 and 1738115474508091027. The 116 is found 6 times and 173 is found 3 times. But, when I run it in IRB, I get the following:

a = '[deleted]'.hash # => 811866697208321010

Here is the code I’m using to produce the hash:

def comment_and_hash(chunk)     
  comment = chunk.at_xpath('*/span[@class="comment"]').text ##Get Comment##
  hash = comment.hash
  return comment,hash
end

I’ve confirmed that I don’t touch comment anywhere else in my code. Here is my datamapper class.

class Comment

    include DataMapper::Resource

    property :uid       , Serial
    property :author    , String
    property :date      , Date
    property :body      , Text
    property :arank     , Float 
    property :srank     , Float 
    property :parent    , Integer #Should Be UID of another comment or blank if parent
    property :value     , Integer #Hash to prevent duplicates from occurring

end

Am I correct in assuming that .hash on a string will return the same value each time it is called on the same string?

Which value is the correct value assuming my string consists of "[deleted]"?

Is there a way I could have different strings inside ruby, but SQL would see them as the same string? That seems to be the most plausible explanation for why this is occurring, but I’m really shooting in the dark.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T02:47:08+00:00

If you run

ruby -e "puts '[deleted]'.hash"

several times, you will notice that the value is different. In fact, the hash value stays only constant as long as your Ruby process is alive. The reason for this is that String#hash is seeded with a random value. rb_str_hash (the C implementing function) uses rb_hash_start which uses this random seed which gets initialized every time Ruby is spawned.

You could use a CRC such as Zlib#crc32 for your purposes or you may want to use one of the message digests of OpenSSL::Digest, although the latter is overkill since for detection of duplicates you probably won’t need the security properties.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am processing a bunch of data and I haven’t coded a duplicate checker

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply