Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8943347
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 15, 20262026-06-15T11:44:58+00:00 2026-06-15T11:44:58+00:00

I was asked this question in an Amazon interview. You have a file with

  • 0

I was asked this question in an Amazon interview.

You have a file with many lines but two of the lines are the same. Find those two lines. I gave the obvious answer that ran in N^2 time. I then came up with an answer that used a hash table, but they didn’t like that answer either because they say it wouldn’t work if the file was in the gigabytes. Another answer I came up with was instead of storing the hash result in memory, create a file with the same name as the hash value, and store the lines with the same same hash value in the file. Either they couldn’t understand my solution or they didn’t like it.

Any thoughts?

Thanks

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-15T11:44:59+00:00Added an answer on June 15, 2026 at 11:44 am

    I can think of two essential classes of solutions to this problem:

    1. Probabilistic in-memory solutions. You can try to solve this problem by storing a summary of the lines of the file in main memory. You can then do computations in main memory to identify possible duplicates, then check each possible duplicate by looking back on disk. These solutions are probably the best, since they have low memory usage, high efficiency, and mimize disk accesses. Solutions in this category include

      1. Compute a hash of each line of the file, then store the hashes. Any lines that have a hash collision represent one possible pair of lines that might collide, and just those lines can be explored.
      2. Use a Bloom Filter to store all the lines of the file, then check only pairs that collide in the Bloom Filter. This is essentially a variation of (1) that is more space-efficient.
    2. Deterministic on-disk solutions. You can try to do computations with the entire data set on-disk, using main memory as temporary scratch space. This would let you get exact answers without having to hold the whole file in memory, but would probably be slower unless you were doing some later processing and could benefit from restructuring the data. Solutions in this category include

      1. Use an external sorting algorithm (external quicksort, external radix sort, etc.) to sort the file, then linear search it for a pair of duplicate elements.
      2. Build an on-disk data structure like a B-tree holding all of the strings, then query the B-tree. This takes a lot of preprocessing time, but makes future operations on the file a lot faster.
      3. Put everything in a database and query the database.

    Hope this helps!

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I asked this question earlier: How to get delta between two text items But
I asked this question earlier. I am intrigued by std::set but I have another
I got this question in an interview with amazon. I was asked to perform
I asked this question before, but I received no solutions, so I have tried
I asked this question before but with less information than I have now. What
I have asked this question before, but it was deleted due too little information.
I asked this question over Security site, and people there suggested I should have
I asked this question to multiple people and until now I do not have
I asked this question about multithreading in servlet, and many people suggest using a
I asked this question about a year ago on another site but never got

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.