Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8769903
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 13, 20262026-06-13T17:27:10+00:00 2026-06-13T17:27:10+00:00

I have a log file which has single strings on each line. I am

  • 0

I have a log file which has single strings on each line. I am trying to remove duplicate data from the file and save the file out as a new file. I had first thought of reading data into a HashSet and then saving the contents of the hashset out, however I get an “OutOfMemory” exception when attempting to do this (on the line that adds the string to the hashset).

There are around 32,000,000 lines in the files. It’s not practical to re-read the entire file for each comparison.

Any ideas? My other thought was to output the entire contents into a SQLite database and selecting DISTINCT values, but I’m not sure that’d work either with that many values.

Thanks for any input!

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-13T17:27:11+00:00Added an answer on June 13, 2026 at 5:27 pm

    First thing you need to think about – is high memory consumption is a problem?

    If your application will always run on server with a lot of RAM available, or in any other case you know you’ll have enough memory, you can do a lot of things you can’t do if your application will run in a low-memory environment, or in an unknown environment. If memory isn’t the problem, then make sure your application is running as a 64-bit application (of course, on 64-bit OS), otherwise you’ll be limited to 2GB memory (4GB, if you’ll use LARGEADDRESSAWARE flag). I guess then in this case this is your problem, and all you’ve got to do is change it – and it’ll work great (assuming you have enough memory).

    If memory is a problem, and you need not to use too much memory, you can as you suggested add all the data to database (i’m more familiar with databases like SQL Server, but i guess SQLite will do), make sure you have the right index on the column, and then select distinct value.

    Another option, is to read the file as a stream, line by line, for each line calculate hash, and save the line into other file, and keep the hash in the memory. if the hash already exists, then moving to the next line (and, if you wish, adding to a counter of number of lines removed). in that case, you’ll save less data in the memory (only hash for not duplicated items).

    Best of luck.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a formatted string from a log file, which looks like: >>> a=test
I have a log file that I'm trying to append data to the end
Have been trying to write an awk script which processes a log file, but
I have a log file which has a format of this kind: DATE-TIME ###
I have a log file (.txt) which has information as below: Filename1 - A3332NCDER
I am working with a log file and I have a method which is
I have developed a web project. Which is generating log file using log4j. But
I have a log method which saves to a file that is named the
I have a log file containing statistics from different servers. I am separating the
I have written a perl code for processing file 'Output.txt' which has below Content.

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.