I was reading about algorithmic problem and one was the following: Having a file

Question

0

Asked: June 17, 20262026-06-17T15:55:01+00:00 2026-06-17T15:55:01+00:00

I was reading about algorithmic problem and one was the following: Having a file

0

I was reading about algorithmic problem and one was the following:

Having a file with millions of lines of data, there are 2 lines which
are identical. The lines are so long that may not fit in memory. Find
the 2 identical lines.

The solution suggested was to read lines in parts and create hashes for each line.
E.g. you build the hash for line 1 by building the hash of part-1 of line 1 (which can be read in memory) and then hash of part-2 of line 1 up to part-N of line 1.
Store the hashes in file or hashtable. For any same hash values compare the lines. If the lines are the same we solved it.

Although I understand this solution in high level, I have no idea how this could be implemented. How can we associate a hash with a specific line in file? Is this language implementation detail?
E.g. in Java how would we address this?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T15:55:02+00:00

The real answer is buy more memory. The longest string you can have in Java 2 GB and that will fit in machines these days. You can buy 32 GB for less than $200.

But to solve the problem, I suggest you

find the offset of each line.
find the lines which are the same length (using the difference of offset)
calculate 64-bit or longer hashes of the lines with the same length.
for the lines with the same hash, do a byte-by-byte comparison.

Note: if you don’t have enough memory to cache the entire file this will take a very long time. If you have a 32 GB machine and it has a 64 GB file, each pass will take about 20 minutes, and this has multiple passes.

1)Which API to find the offset?

You count the number of bytes you have read, and that is the offset.

2)The real answer is buy more memory Project Managers don’t agree on this for real products. Do you have different experience?

I point out to them that I could spend a day which could cost them > $1000 (even if that is not what I get paid) saving $100 of reusable memory if they think that is good use of resources. I let them decide 😉

My 8 year old son has 8 GB in a PC he built as the memory cost me £24. Yet you are right that there are project mangers who think 8 GB is too much for a professional which is costing them that much per hour!? I have 16 GB in PC which I don’t use to run anything serious because I do my work on machine with 256 GB. You can buy machines with 2 TB these days which is overkill for most applications. 😉

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I was reading about algorithmic problem and one was the following: Having a file

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply