Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7045745
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 28, 20262026-05-28T02:34:04+00:00 2026-05-28T02:34:04+00:00

I’m looking for an algorithm which can generate a short (fx 16 chars (not

  • 0

I’m looking for an algorithm which can generate a short (fx 16 chars (not important) hashcode/digest from a longer string.

The main requirement is that strings which is almost identical should result in the same digest.

Fx 2 almost identical mail:

Hi Martin. Here are some … spam for you. Regards XYZ.
=> AAAA AAAA AAAA AAAA

Hi Bo. Here are some … spam for you. Regards EFG.
=> AAAA AAAA AAAA AAAA

returns the same diges (or almost the same), where as a different mail:

Hello Finn. This is a test mail.
=> CCCC CCCC CCCC CCCC

will return a different digest.

This algorithm would be part of a spam filter. The filter will remember digests from mails which it is certain is spam. If the same digest shows up in mails where it is in doubt, the identical digest will cause the filter to increase the spamscore.

I know about Levenshtein, but it requires me to know the strings up front. In this situation i do not have this information. I could have this information, but that would require the filter for store all spam e-mail and check against each one, which would be a very slow process.

Maybe some loose compression algorithm coupled with a calc of the Levenshtein distance between the two could work.

Any pointers appreciated.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-28T02:34:05+00:00Added an answer on May 28, 2026 at 2:34 am

    It looks like you want locality-sensitive hashing. Consider using minhash or shingling. There’s a great explanation of both in Rajaraman & Ullman’s book, Mining Massive Datasets. You’ll find numerous, short implementations in python searching blogs for the keywords above.

    There seem to be other approaches to this (that I don’t know much about), but that may be of interest to you since they are specially tailored for spam messages, in particular the nilsimsa hash:

    • explained in that paper
    • which has a python port on pypi
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Does anyone know how can I replace this 2 symbol below from the string
I have a jquery bug and I've been looking for hours now, I can't
For some reason, after submitting a string like this Jack’s Spindle from a text
I have a text area in my form which accepts all possible characters from
I have a string like this: La Torre Eiffel paragonata all’Everest What PHP function
link Im having trouble converting the html entites into html characters, (&# 8217;) i
I want to count how many characters a certain string has in PHP, but
I am trying to understand how to use SyndicationItem to display feed which is
I used javascript for loading a picture on my website depending on which small
I'm new to using the Perl treebuilder module for HTML parsing and can't figure

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.