Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6666029
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 26, 20262026-05-26T02:47:08+00:00 2026-05-26T02:47:08+00:00

I am processing a bunch of data and I haven’t coded a duplicate checker

  • 0

I am processing a bunch of data and I haven’t coded a duplicate checker into the data processor yet, so I expected duplicates to occur. I ran the following SQL query:

SELECT     body, COUNT(body) AS dup_count 
FROM         comments
GROUP BY body
HAVING     (COUNT(body) > 1) 

And get back a list of duplicates. Looking into this I find that these duplicates have multiple hashes. The shortest string of a comment is "[deleted]". So let’s use that as an example. In my database there are nine instances of a comment being "[deleted]" and in my database this produces a hash of both 1169143752200809218 and 1738115474508091027. The 116 is found 6 times and 173 is found 3 times. But, when I run it in IRB, I get the following:

a = '[deleted]'.hash # => 811866697208321010

Here is the code I’m using to produce the hash:

def comment_and_hash(chunk)     
  comment = chunk.at_xpath('*/span[@class="comment"]').text ##Get Comment##
  hash = comment.hash
  return comment,hash
end

I’ve confirmed that I don’t touch comment anywhere else in my code. Here is my datamapper class.

class Comment

    include DataMapper::Resource

    property :uid       , Serial
    property :author    , String
    property :date      , Date
    property :body      , Text
    property :arank     , Float 
    property :srank     , Float 
    property :parent    , Integer #Should Be UID of another comment or blank if parent
    property :value     , Integer #Hash to prevent duplicates from occurring

end

Am I correct in assuming that .hash on a string will return the same value each time it is called on the same string?

Which value is the correct value assuming my string consists of "[deleted]"?

Is there a way I could have different strings inside ruby, but SQL would see them as the same string? That seems to be the most plausible explanation for why this is occurring, but I’m really shooting in the dark.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-26T02:47:08+00:00Added an answer on May 26, 2026 at 2:47 am

    If you run

    ruby -e "puts '[deleted]'.hash"

    several times, you will notice that the value is different. In fact, the hash value stays only constant as long as your Ruby process is alive. The reason for this is that String#hash is seeded with a random value. rb_str_hash (the C implementing function) uses rb_hash_start which uses this random seed which gets initialized every time Ruby is spawned.

    You could use a CRC such as Zlib#crc32 for your purposes or you may want to use one of the message digests of OpenSSL::Digest, although the latter is overkill since for detection of duplicates you probably won’t need the security properties.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm working a program that will have a bunch of threads processing data. Each
I have a bunch of flows and data processing applications that I occasionally need
I have a simple sketch (in Processing ), basically a bunch of dots wander
In data processing, I frequently need to create a lookup data structure to map
I'm using Quartz.NET to schedule a job which loads a bunch of data from
I want to do identical processing to a bunch of arguments of a function.
We have a Mono application under Linux that does image processing on a bunch
I have an NSMutableArray that is loaded into a table view. The data in
I have a bunch of data that i want to insert and i have
I have an application that takes a bunch of data and processes it in

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.