Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 911181
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 15, 20262026-05-15T17:09:38+00:00 2026-05-15T17:09:38+00:00

somewhat open ended question here as I am mostly looking for opinions. I am

  • 0

somewhat open ended question here as I am mostly looking for opinions. I am grabbing some data from craigslist for apt ads in my area since I am looking to move. My goal is to be able to compare items to see when something is a duplicate so that I don’t spend all day looking at the same 3 ads. The problem is that they change things around a little to get past CL’s filters.

I already have some regex to look for address and phone numbers to compare, but that isn’t the most reliable. Is anyone familiar with an easy-ish method to compare the whole document and maybe show something simple like “80% similar”? I can’t think of anything offhand, so I suspect I’ll have to start from scratch on my own solution, but figured it would be worth asking the collective genius of stackoverflow 🙂

Preferred languages/methods would be python/php/perl, but if it’s a great solution I’m pretty open.

Update: one thing worth noting is that since I will be storing the scraped data of the rss feed for apts in my area (los angeles) in a local DB, the preferred method would include a way to compare it to everything I currently know. This could be a bit of a showstopper since that could become a very long process as the post counts grow.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-15T17:09:39+00:00Added an answer on May 15, 2026 at 5:09 pm

    You could calculate the Levenshtein difference between both strings – after some sane normalizing like minimizing duplicate whitespace and what not. After you run through enough “duplicates” you should get an idea of what your threshold is – then you can run Levenshtein on all new incoming data and if its less-than-equal to your threshold than you can consider it a duplicate.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

No related questions found

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.