Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9169715
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 17, 20262026-06-17T15:55:01+00:00 2026-06-17T15:55:01+00:00

I was reading about algorithmic problem and one was the following: Having a file

  • 0

I was reading about algorithmic problem and one was the following:

Having a file with millions of lines of data, there are 2 lines which
are identical. The lines are so long that may not fit in memory. Find
the 2 identical lines.

The solution suggested was to read lines in parts and create hashes for each line.
E.g. you build the hash for line 1 by building the hash of part-1 of line 1 (which can be read in memory) and then hash of part-2 of line 1 up to part-N of line 1.
Store the hashes in file or hashtable. For any same hash values compare the lines. If the lines are the same we solved it.

Although I understand this solution in high level, I have no idea how this could be implemented. How can we associate a hash with a specific line in file? Is this language implementation detail?
E.g. in Java how would we address this?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-17T15:55:02+00:00Added an answer on June 17, 2026 at 3:55 pm

    The real answer is buy more memory. The longest string you can have in Java 2 GB and that will fit in machines these days. You can buy 32 GB for less than $200.


    But to solve the problem, I suggest you

    • find the offset of each line.
    • find the lines which are the same length (using the difference of offset)
    • calculate 64-bit or longer hashes of the lines with the same length.
    • for the lines with the same hash, do a byte-by-byte comparison.

    Note: if you don’t have enough memory to cache the entire file this will take a very long time. If you have a 32 GB machine and it has a 64 GB file, each pass will take about 20 minutes, and this has multiple passes.


    1)Which API to find the offset?

    You count the number of bytes you have read, and that is the offset.

    2)The real answer is buy more memory Project Managers don’t agree on this for real products. Do you have different experience?

    I point out to them that I could spend a day which could cost them > $1000 (even if that is not what I get paid) saving $100 of reusable memory if they think that is good use of resources. I let them decide 😉

    My 8 year old son has 8 GB in a PC he built as the memory cost me £24. Yet you are right that there are project mangers who think 8 GB is too much for a professional which is costing them that much per hour!? I have 16 GB in PC which I don’t use to run anything serious because I do my work on machine with 256 GB. You can buy machines with 2 TB these days which is overkill for most applications. 😉

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

After reading about the problem of passing empty std::string objects between DLLs and EXEs,
I am reading about String algorithms in Introduction to Algorithms by Cormen etc Following
I am reading about Dynamic programming in Cormen etc book on algorithms. following is
I am reading about multidimensional sorts in Algorithms by C++ RobertSedwick which is as
Recently i'm reading the book Operating System Concepts Chapter VI about critical section problem,
I am thinking about exploiting parallelism for one problem I am trying to solve.
I've been reading about path-finding algorithms and I'm currently looking for one that works
I am reading about DFS in Introduction to Algorithms by Cormen. Following is text
I am reading about AVL trees in Data structures and algorithms by Mark Allen
I'm currently reading about algorithmic analysis and I read that a certain algorithm (weighted

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.