I have a log file which has single strings on each line. I am

Question

0

Asked: June 13, 20262026-06-13T17:27:10+00:00 2026-06-13T17:27:10+00:00

I have a log file which has single strings on each line. I am

0

I have a log file which has single strings on each line. I am trying to remove duplicate data from the file and save the file out as a new file. I had first thought of reading data into a HashSet and then saving the contents of the hashset out, however I get an “OutOfMemory” exception when attempting to do this (on the line that adds the string to the hashset).

There are around 32,000,000 lines in the files. It’s not practical to re-read the entire file for each comparison.

Any ideas? My other thought was to output the entire contents into a SQLite database and selecting DISTINCT values, but I’m not sure that’d work either with that many values.

Thanks for any input!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-13T17:27:11+00:00

First thing you need to think about – is high memory consumption is a problem?

If your application will always run on server with a lot of RAM available, or in any other case you know you’ll have enough memory, you can do a lot of things you can’t do if your application will run in a low-memory environment, or in an unknown environment. If memory isn’t the problem, then make sure your application is running as a 64-bit application (of course, on 64-bit OS), otherwise you’ll be limited to 2GB memory (4GB, if you’ll use LARGEADDRESSAWARE flag). I guess then in this case this is your problem, and all you’ve got to do is change it – and it’ll work great (assuming you have enough memory).

If memory is a problem, and you need not to use too much memory, you can as you suggested add all the data to database (i’m more familiar with databases like SQL Server, but i guess SQLite will do), make sure you have the right index on the column, and then select distinct value.

Another option, is to read the file as a stream, line by line, for each line calculate hash, and save the line into other file, and keep the hash in the memory. if the hash already exists, then moving to the next line (and, if you wish, adding to a counter of number of lines removed). in that case, you’ll save less data in the memory (only hash for not duplicated items).

Best of luck.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a log file which has single strings on each line. I am

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply