Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7830475
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 2, 20262026-06-02T11:14:51+00:00 2026-06-02T11:14:51+00:00

In this question: Detecting duplicate lines on file using c i can detect duplicate

  • 0

In this question:
Detecting duplicate lines on file using c
i can detect duplicate lines, but how we can remove this lines from our file?

Thanks.

Edit : To add my code :

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

struct somehash {
    struct somehash *next;
        unsigned hash;
        char *mem;
};

#define THE_SIZE 100000

struct somehash *table[THE_SIZE] = { NULL,};

struct somehash **some_find(char *str, unsigned len);
static unsigned some_hash(char *str, unsigned len);

int main (void)
{
    char buffer[100];
    struct somehash **pp;
    size_t len;
    FILE * pFileIn;
    FILE * pFileOut;

    pFileIn  = fopen("in.csv", "r");
    pFileOut  = fopen("out.csv", "w+");

    if (pFileIn==NULL) perror ("Error opening input file");
    if (pFileOut==NULL) perror ("Error opening output file");

    while (fgets(buffer, sizeof buffer, pFileIn)) {
            len = strlen(buffer);
            pp = some_find(buffer, len);
            if (*pp) { /* found */
                fprintf(stderr, "Duplicate:%s\n", buffer);
                }
            else    
        {       /* not found: create one */
                    fprintf(stdout, "%s", buffer);
                    fprintf(pFileOut, "%s", buffer);
                    *pp = malloc(sizeof **pp);
                    (*pp)->next = NULL;
                    (*pp)->hash = some_hash(buffer,len);
                    (*pp)->mem = malloc(1+len);
                    memcpy((*pp)->mem , buffer,  1+len);
                }
        }

return 0;
}

struct somehash **some_find(char *str, unsigned len)
{
    unsigned hash;
    unsigned short slot;
    struct somehash **hnd;

    hash = some_hash(str,len);
    slot = hash % THE_SIZE;
    for (hnd = &table[slot]; *hnd ; hnd = &(*hnd)->next ) {
        if ( (*hnd)->hash != hash) continue;
            if ( strcmp((*hnd)->mem , str) ) continue;
                break;
        }

    return hnd;
}

static unsigned some_hash(char *str, unsigned len)
{
    unsigned val;
    unsigned idx;

    if (!len) len = strlen(str);

    val = 0;
    for(idx=0; idx < len; idx++ )   {
            val ^= (val >> 2) ^ (val << 5) ^ (val << 13) ^ str[idx] ^ 0x80001801;
    }

    return val;
}

But in the output file we got always the first occurrence!

Edit 2: To clarify: the intent is to find all duplicates in an input file. When there is more than one instance of a line in the input, that line should not appear in the output at all. The intent is not just to remove duplicates of that line so each occurs only once, but to remove all instances of a line if that line is duplicated in the input.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-02T11:14:53+00:00Added an answer on June 2, 2026 at 11:14 am

    Essentially the only way to remove lines from a text file is to copy the file without those lines in the copy. The usual would be something on this order:

    while (fgets(buffer, size, infile))
        if (search(your_hashtable, buffer) == NOT_FOUND) {
            fputs(line, outfile);
            insert(your_hashtable, buffer);
        }
    

    If you want to save some storage space, you might store hashes instead of complete lines. In theory that could fail due to a hash collision, but if you use a cryptographic hash like SHA-256, chances of a collision are probably slower than the chances of a string comparison coming out wrong due to a CPU error. Besides: if you find a collision with SHA-256, you can probably get at least a little fame (if not fortune) from that alone.

    Edit: As @Zack alluded to, the situation with hash size is basically a matter of deciding what chance of a collision you’re willing to accept. With a crypographic 256-bit hash, the chances are so remote it’s hardly worth considering. If you reduce that to, say, a 128-bit hash, the chances go up quite a bit, but they’re still small enough for most practical purposes. On the other hand, if you were to reduce it to, say, a 32-bit CRC, chances of a collision are probably higher than I’d be happy accepting if the data mattered much.

    I should probably mention one more possibility: another possibility would be to use a bit of a hybrid — store something like a 32-bit CRC (which is really fast to compute) along with the offset where that line in the file starts. If your file never exceeds 4G, you can store both in only 8 bytes.

    In this case, you work just a little differently: you start by computing the CRC, and the vast majority of the time, when it’s not in the file, you copy the file to the output and insert those values in the hash table. When it is already in the table, you seek back to the possibly-identical line, read it back in, and compare to the current line. If they match, you go back to where you were and advance to the next line. If they don’t match, you copy the current line to the output, and add its offset to the hash table.

    Edit 2: Let’s assume for the moment that the file is small enough that you can reasonably fit the whole thing in memory. In that case, you can store a line, and a line number where it occurred. If a line is already stored, you can change its line number to -1, to indicate that it was duplicated and shouldn’t appear in the output.

    In C++ (since it defines the relevant data structures), it could look something like this:

    std::string line;
    
    typedef std::map<std::string, int> line_record;
    
    line_record lines;
    int line_number = 1;
    
    while (std::getline(line, infile)) {
        line_record::iterator existing = lines.find(line);
        if (existing != lines.end()) // if it was already in the map
            existing->second = -1;    // indicate that it's duplicated
        else
            lines.insert(std::make_pair(line, line_number); // otherwise, add it to map
        ++line_number;
    }
    

    Okay, that reads in the lines, and for each line, it checks whether it’s already in the map. If it is, it sets the line_number to -1, to indicate that it won’t appear in the output. If it wasn’t it inserts it into the map along with its line number.

    line_record::iterator pos;
    
    std::vector<line_record::iterator> sortable_lines;
    
    for (pos=lines.begin(); pos != lines.end(); ++pos)
        if (pos->second != -1)
            sortable_lines.push_back(pos);
    

    This sets up sortable_lines as a vector of iterators into the map, so instead of copying entire lines, we’ll just copy iterators (essentially like pointers) to those lines. It then copies the iterators into there, but only for lines where the line number isn’t -1.

    std::sort(sortable_lines.begin(), sortable_lines.end(), by_line_number());
    
    struct by_line_number {
         bool operator()(line_record::iterator a, line_record::iterator b) { 
             return a->second < b->second;
         }
    };
    

    Then we sort those iterators by the line number.

    for (int i=0; i<sortable_lines.size(); i++)
         outfile << sortable_lines[i]->first << "\n";
    

    Finally, we copy each line to the output file, in order by their original line numbers.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

This question must be obvious but I can't figure it out. In a template,
There is a similar question to this but answer is very general, vague.( Detecting
At first glance, this question may seem like a duplicate of How to detect
This question is a bit different than most. My code works but I don't
This question may sound familiar but not quite the same as asked before. I
This question might be slightly subjective, but I am unsure where else it would
I would like to apologize if the duplicate of this question exist. i tried
When detecting eyes using HaarDetectObject() function, we get the results (detectedObjects) like this: [((110,
Possible Duplicate: Location detecting techniques for IP addresses For our website it's important to
This question has a great answer for detecting cycles in a directed graph. Unfortunately,

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.