I have a general question on your opinion about my technique. There are 2

Question

0

Asked: May 24, 20262026-05-24T18:52:30+00:00 2026-05-24T18:52:30+00:00

I have a general question on your opinion about my technique. There are 2

0

I have a general question on your opinion about my “technique”.

There are 2 textfiles (file_1 and file_2) that need to be compared to each other. Both are very huge (3-4 gigabytes, from 30,000,000 to 45,000,000 lines each).
My idea is to read several lines (as many as possible) of file_1 to the memory, then compare those to all lines of file_2. If there’s a match, the lines from both files that match shall be written to a new file. Then go on with the next 1000 lines of file_1 and also compare those to all lines of file_2 until I went through file_1 completely.

But this sounds actually really, really time consuming and complicated to me.
Can you think of any other method to compare those two files?

How long do you think the comparison could take?
For my program, time does not matter that much. I have no experience in working with such huge files, therefore I have no idea how long this might take. It shouldn’t take more than a day though. 😉 But I am afraid my technique could take forever…

Antoher question that just came to my mind: how many lines would you read into the memory? As many as possible? Is there a way to determine the number of possible lines before actually trying it?
I want to read as many as possible (because I think that’s faster) but I’ve ran out of memory quite often.

Thanks in advance.

EDIT
I think I have to explain my problem a bit more.

The purpose is not to see if the two files in general are identical (they are not).
There are some lines in each file that share the same “characteristic”.
Here’s an example:
file_1 looks somewhat like this:

mat1 1000 2000 TEXT      //this means the range is from 1000 - 2000
mat1 2040 2050 TEXT
mat3 10000 10010 TEXT
mat2 20 500 TEXT

file_2looks like this:

mat3 10009 TEXT
mat3 200 TEXT
mat1 999 TEXT

TEXT refers to characters and digits that are of no interest for me, mat can go from mat1 - mat50 and are in no order; also there can be 1000x mat2 (but the numbers in the next column are different). I need to find the fitting lines in a way that: matX is the same in both compared lines an the number mentioned in file_2 fits into the range mentioned in file_1.
So in my example I would find one match: line 3 of file_1and line 1 of file_2 (because both are mat3 and 10009 is between 10000 and 10010).
I hope this makes it clear to you!

So my question is: how would you search for the matching lines?

Yes, I use Java as my programming language.

EDIT
I now divided the huge files first so that I have no problems with being out of memory. I also think it is faster to compare (many) smaller files to each other than those two huge files. After that I can compare them the way I mentioned above. It may not be the perfect way, but I am still learning 😉
Nonentheless all your approaches were very helpful to me, thank you for your replies!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-24T18:52:31+00:00

Now that you’ve given us more specifics, the approach I would take relies upon pre-partitioning, and optionally, sorting before searching for matches.

This should eliminate a substantial amount of comparisons that wouldn’t otherwise match anyway in the naive, brute-force approach. For the sake of argument, lets peg both files at 40 million lines each.

Partitioning: Read through file_1 and send all lines starting with mat1 to file_1_mat1, and so on. Do the same for file_2. This is trivial with a little grep, or should you wish to do it programmatically in Java it’s a beginner’s exercise.

That’s one pass through two files for a total of 80million lines read, yielding two sets of 50 files of 800,000 lines each on average.

Sorting: For each partition, sort according to the numeric value in the second column only (the lower bound from file_1 and the actual number from file_2). Even if 800,000 lines can’t fit into memory I suppose we can adapt 2-way external merge sort and perform this faster (fewer overall reads) than a sort of the entire unpartitioned space.

Comparison: Now you just have to iterate once through both pairs of file_1_mat1 and file_2_mat1, without need to keep anything in memory, outputting matches to your output file. Repeat for the rest of the partitions in turn. No need for a final ‘merge’ step (unless you’re processing partitions in parallel).

Even without the sorting stage the naive comparison you’re already doing should work faster across 50 pairs of files with 800,000 lines each rather than with two files with 40 million lines each.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a general question on your opinion about my technique. There are 2

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply