I have a 2GB big text file, it has 5 columns delimited by tab.

Question

0

Asked: June 1, 20262026-06-01T15:56:17+00:00 2026-06-01T15:56:17+00:00

I have a 2GB big text file, it has 5 columns delimited by tab.

0

I have a 2GB big text file, it has 5 columns delimited by tab.
A row will be called duplicate only if 4 out of 5 columns matches.

Right now, I am doing dduping by first loading each coloumn in separate List
, then iterating through lists, deleting the duplicate rows as it encountered and aggregating.

The problem: it is taking more than 20 hours to process one file.
I have 25 such files to process.

Can anyone please share their experience, how they would go about doing such dduping?

This dduping will be a throw away code. So, I was looking for some quick/dirty solution, to get job done as soon as possible.

Here is my pseudo code (roughly)

Iterate over the rows
  i=current_row_no.    
    Iterate over the row no. i+1 to last_row
                    if(col1 matches  //find duplicate
                        && col2 matches
                        && col3 matches  
                        && col4 matches)
                        { 
                           col5List.set(i,get col5); //aggregate 
                        }

Duplicate example

A and B will be duplicate A=(1,1,1,1,1), B=(1,1,1,1,2), C=(2,1,1,1,1) and output would be A=(1,1,1,1,1+2) C=(2,1,1,1,1) [notice that B has been kicked out]

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-01T15:56:18+00:00

A HashMap will be your best bet. In a single, constant time operation, you can both check for duplication and fetch the appropriate aggregation structure (a Set in my code). This means that you can traverse the entire file in O(n). Here’s some example code:

public void aggregate() throws Exception
  {
    BufferedReader bigFile = new BufferedReader(new FileReader("path/to/file.csv"));

    // Notice the paramter for initial capacity. Use something that is large enough to prevent rehashings.
    Map<String, HashSet<String>> map = new HashMap<String, HashSet<String>>(500000);

    while (bigFile.ready())
    {
      String line = bigFile.readLine();
      int lastTab = line.lastIndexOf('\t');
      String firstFourColumns = line.substring(0, lastTab);

      // See if the map already contains an entry for the first 4 columns
      HashSet<String> set = map.get(firstFourColumns);

      // If set is null, then the map hasn't seen these columns before
      if (set==null)
      {
        // Make a new Set (for aggregation), and add it to the map
        set = new HashSet<String>();
        map.put(firstFourColumns, set);
      }

      // At this point we either found set or created it ourselves
      String lastColumn = line.substring(lastTab+1);
      set.add(lastColumn);
    }
    bigFile.close();

    // A demo that shows how to iterate over the map and set structures
    for (Map.Entry<String, HashSet<String>> entry : map.entrySet())
    {
      String firstFourColumns = entry.getKey();
      System.out.print(firstFourColumns + "=");

      HashSet<String> aggregatedLastColumns = entry.getValue();
      for (String column : aggregatedLastColumns)
      {
        System.out.print(column + ",");
      }
      System.out.println("");
    }
  }

A few points:

The initialCapaticy parameter for the HashMap is important. If the number of entries gets bigger than the capacity, then the structure is re-hashed, which is very slow. The default initial capacity is 16, which will cause many rehashes for you. Pick a value that you know is greater than the number of unique sets of the first four columns.
If ordered output in the aggregation is important, you can switch the HashSet for a TreeSet.
This implementation will use a lot of memory. If your text file is 2GB, then you’ll probably need a lot of RAM in the jvm. You can add the jvm arg -Xmx4096m to increase the maximum heap size to 4GB. If you don’t have at least 4GB this probably won’t work for you.
This is also a parallelizable problem, so if you’re desperate you could thread it. That would be a lot of effort for throw-away code, though. [Edit: This point is likely not true, as pointed out in the comments]

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a 2GB big text file, it has 5 columns delimited by tab.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply