Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6131845
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 23, 20262026-05-23T17:01:42+00:00 2026-05-23T17:01:42+00:00

I am looking to optimize a fairly simple algorithm that is currently O(n 2

  • 0

I am looking to optimize a fairly simple algorithm that is currently
O(n2). I have a file of records, where each one needs to
be compared to every other in the same file. If the two are the
‘same’ (comparator function is pretty complicated), the matched
records are output. Note that there may be several records that match
each other, and there is no sense of order – only if the match is True or False.

Pseudo code:


For (outRec in sourceFile) {
  Get new filePointer for targetFile //starting from the top of the file for inner loop
  For (inRec in targetFile) {
    if (compare(outRec, inRec) == TRUE ) {
      write outRec
      write inRec
    }
    increment some counters
  }
  increment some other counters
}

The data is not sorted in any way, and there is no preprocessing
possible to order the data.

Any ideas on how this could become something less than
O(n2)? I am thinking of applying the MapReduce paradigm
on the code, breaking up the outer AND inner loops, possibly using a
chained Map function. I am pretty sure I have the code figured out on
Hadoop, but wanted to check alternatives before I spent time coding
it.

Suggestions appreciated!

Added: Record types. Basically, I need to match names/strings. The
types of matching are shown in the example below.


1,Joe Smith,Daniel Foster
2,Nate Johnson,Drew Logan
3,Nate Johnson, Jack Crank
4,Joey Smyth,Daniel Jack Foster
5,Joe Morgan Smith,Daniel Foster

Expected output: Records 1,4,5 form a match set End of output

Added: these files will be quite large. The largest file is
expected to be around 200 million records.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-23T17:01:42+00:00Added an answer on May 23, 2026 at 5:01 pm

    Assuming the files aren’t ridiculously large, I’d go through the file in its entirety, and compute a hash for the row, and keep track of hash/line # (or file pointer position) combinations. Then sort the list of hashes, and identify those that appear more than once.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm looking for a way to optimize one SQL query that I have. I'm
I am currently trying to optimize a function I have and am looking to
I am looking to optimize a process that runs continually and makes frequent calls
I've got a fairly big XML file that I need to parse into a
Im looking to optimize our translation workflow for a django/python based project. Currently we
Looking for an example that: Launches an EXE Waits for the EXE to finish.
Looking for a Linux application (or Firefox extension) that will allow me to scrape
I looking to optimize an existing Makefile. It's used to create multiple plots (using
I'm looking to optimize my SQL. My database schema is: HOMES home_id address city
I have a web service (JAX-RS/Spring) that generates SQL queries which run against a

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.