Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 4110112
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 20, 20262026-05-20T21:53:30+00:00 2026-05-20T21:53:30+00:00

This question has been asked here in one form or another but not quite

  • 0

This question has been asked here in one form or another but not quite the thing I’m looking for. So, this is the situation I shall be having: I already have one file, named file_a and I’m creating another file – file_b. file_a is always bigger than file_b in size. There will be a number of duplicate lines in file_b (hence, in file_a as well) but both the files will have some unique lines. What I want to do is: to copy/merge only the unique lines from file_a to file_b and then sort the line order, so that the file_b becomes the most up-to-date one with all the unique entries. Either of the original files shouldn’t be more than 10MB in size. What’s the most efficient (and fastest) way I can do that?

I was thinking something like that, which does the merging alright.

#!/usr/bin/env python

import os, time, sys

# Convert Date/time to epoch
def toEpoch(dt):
    dt_ptrn = '%d/%m/%y %H:%M:%S'
    return int(time.mktime(time.strptime(dt, dt_ptrn)))

# input files
o_file = "file_a"
c_file = "file_b"
n_file = [o_file,c_file]

m_file = "merged.file"

for x in range(len(n_file)):
    P = open(n_file[x],"r")
    output = P.readlines()
    P.close()

    # Sort the output, order by 2nd last field
    #sp_lines = [ line.split('\t') for line in output ]
    #sp_lines.sort( lambda a, b: cmp(toEpoch(a[-2]),toEpoch(b[-2])) )

    F = open(m_file,'w') 
    #for line in sp_lines:
    for line in output:
        if "group_" in line:
            F.write(line)
    F.close()

But, it’s:

  • not with only the unique lines
  • not sorted (by next to last field)
  • and introduces the 3rd file i.e. m_file

Just a side note (long story short): I can’t use sorted() here as I’m using v2.3, unfortunately. The input files look like this:

On 23/03/11 00:40:03
JobID   Group.User          Ctime   Wtime   Status  QDate               CDate
===================================================================================
430792  group_atlas.pltatl16    0   32  4   02/03/11 21:52:38   02/03/11 22:02:15
430793  group_atlas.atlas084    30  472 4   02/03/11 21:57:43   02/03/11 22:09:35
430794  group_atlas.atlas084    12  181 4   02/03/11 22:02:37   02/03/11 22:05:42
430796  group_atlas.atlas084    8   185 4   02/03/11 22:02:38   02/03/11 22:05:46

I tried to use cmp() to sort by the 2nd last field but, I think, it doesn’t work just because of the first 3 lines of the input files.

Can anyone please help? Cheers!!!


Update 1:

For the future reference, as suggested by Jakob, here is the complete script. It worked just fine.

#!/usr/bin/env python

import os, time, sys
from sets import Set as set

def toEpoch(dt):
    dt_ptrn = '%d/%m/%y %H:%M:%S'
    return int(time.mktime(time.strptime(dt, dt_ptrn)))

def yield_lines(fileobj):
    #I want to discard the headers
    for i in xrange(3):
        fileobj.readline()
    #
    for line in fileobj:
        yield line

def app(path1, path2):
    file1 = set(yield_lines(open(path1)))
    file2 = set(yield_lines(open(path2)))
    return file1.union(file2)

# Input files
o_file = "testScript/03"
c_file = "03.bak"
m_file = "finished.file"

print time.strftime('%H:%M:%S', time.localtime())

# Sorting the output, order by 2nd last field
sp_lines = [ line.split('\t') for line in app(o_file, c_file) ]
sp_lines.sort( lambda a, b: cmp(toEpoch(a[-2]),toEpoch(b[-2])) )

F = open(m_file,'w')
print "No. of lines: ",len(sp_lines)

for line in sp_lines:

    MF = '\t'.join(line)
    F.write(MF)
F.close()

It took about 2m:47s to finish for 145244 lines.

[testac1@serv07 ~]$ ./uniq-merge.py 
17:19:21
No. of lines:  145244
17:22:08

thanks!!


Update 2:

Hi eyquem, this is the Error message I get when I run your script(s).

From the first script:

[testac1@serv07 ~]$ ./uniq-merge_2.py 
  File "./uniq-merge_2.py", line 44
    fm.writelines( '\n'.join(v)+'\n' for k,v in output )
                                       ^
SyntaxError: invalid syntax

From the second script:

[testac1@serv07 ~]$ ./uniq-merge_3.py 
  File "./uniq-merge_3.py", line 24
    output = sett(line.rstrip() for line in fa)
                                  ^
SyntaxError: invalid syntax

Cheers!!


Update 3:

The previous one wasn’t sorting the list at all. Thanks to eyquem to pointing that out. Well, it does now. This is a further modification to Jakob’s version – I converted the set:app(path1, path2) to a list:myList() and then applied the sort( lambda … ) to the myList to sort the merged file by the nest to last field. This is the final script.

#!/usr/bin/env python

import os, time, sys
from sets import Set as set

def toEpoch(dt):
    # Convert date/time to epoch
    dt_ptrn = '%d/%m/%y %H:%M:%S'
    return int(time.mktime(time.strptime(dt, dt_ptrn)))

def yield_lines(fileobj):
    # Discard the headers (1st 3 lines)
    for i in xrange(3):
        fileobj.readline()

    for line in fileobj:
        yield line

def app(path1, path2):
    # Remove duplicate lines
    file1 = set(yield_lines(open(path1)))
    file2 = set(yield_lines(open(path2)))
    return file1.union(file2)

print time.strftime('%H:%M:%S', time.localtime())

# I/O files
o_file = "testScript/03"
c_file = "03.bak"
m_file = "finished.file"

# Convert set into to list
myList = list(app(o_file, c_file))

# Sort the list by the date
sp_lines = [ line.split('\t') for line in myList ]
sp_lines.sort( lambda a, b: cmp(toEpoch(a[-2]),toEpoch(b[-2])) )

F = open(m_file,'w')
print "No. of lines: ",len(sp_lines)

# Finally write to the outFile
for line in sp_lines:
    MF = '\t'.join(line)
    F.write(MF)
F.close()

There is no speed boost at all, it took 2m:50s to process the same 145244 lines. Is anyone see any scope of improvement, please let me know. Thanks to Jakob and eyquem for their time. Cheers!!


Update 4:

Just for future reference, this is a modified version of eyguem, which works much better and faster then the previous ones.

#!/usr/bin/env python

import os, sys, re
from sets import Set as sett
from time import mktime, strptime, strftime

def sorting_merge(o_file, c_file, m_file ):

    # RegEx for Date/time filed
    pat = re.compile('[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d')

    def kl(lines,pat = pat):
        # match only the next to last field
        line = lines.split('\t')
        line = line[-2]
        return mktime(strptime((pat.search(line).group()),'%d/%m/%y %H:%M:%S'))

    output = sett()
    head = []

    # Separate the header & remove the duplicates
    def rmHead(f_n):
        f_n.readline()
        for line1 in f_n:
            if pat.search(line1):  break
            else:  head.append(line1) # line of the header
        for line in f_n:
            output.add(line.rstrip())
        output.add(line1.rstrip())
        f_n.close()

    fa = open(o_file, 'r')
    rmHead(fa)

    fb = open(c_file, 'r')
    rmHead(fb)

    # Sorting date-wise
    output = [ (kl(line),line.rstrip()) for line in output if line.rstrip() ]
    output.sort()

    fm = open(m_file,'w')
    # Write to the file & add the header
    fm.write(strftime('On %d/%m/%y %H:%M:%S\n')+(''.join(head[0]+head[1])))
    for t,line in output:
        fm.write(line + '\n')
    fm.close()


c_f = "03_a"
o_f = "03_b"

sorting_merge(o_f, c_f, 'outfile.txt')

This version is much faster – 6.99 sec. for 145244 lines compare to the 2m:47s – then the previous one using lambda a, b: cmp(). Thanks to eyquem for all his support. Cheers!!

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-20T21:53:30+00:00Added an answer on May 20, 2026 at 9:53 pm

    EDIT 2

    My previous codes have problems with output = sett(line.rstrip() for line in fa) and output.sort(key=kl)

    Moreover, they have some complications.

    So I examined the choice of reading the files directly with a set() function taken by Jakob Bowyer in his code.

    Congratulations Jakob ! (and Michal Chruszcz by the way) : set() is unbeatable, it’s faster than a reading one line at a time.

    Then , I abandonned my idea to read the files line after line.

    .

    But I kept my idea to avoid a sorting with the help of cmp() function because, as it is described in the doc:

    s.sort([cmpfunc=None])

    The sort() method takes an optional
    argument specifying a comparison
    function of two arguments (list items)
    (…) Note that this slows the sorting
    process down considerably

    http://docs.python.org/release/2.3/lib/typesseq-mutable.html

    Then, I managed to obtain a list of tuples (t,line) in which the t is

    time.mktime(time.strptime(( 1st date-and-hour in line ,'%d/%m/%y %H:%M:%S'))
    

    by the instruction

    output = [ (kl(line),line.rstrip()) for line in output]
    

    .

    I tested 2 codes. The following one in which 1st date-and-hour in line is computed thanks to a regex:

    def kl(line,pat = pat):
        return time.mktime(time.strptime((pat.search(line).group()),'%d/%m/%y %H:%M:%S'))
    
    output = [ (kl(line),line.rstrip()) for line in output if line.rstrip()]
    
    output.sort()
    

    And a second code in which kl() is:

    def kl(line,pat = pat):
        return time.mktime(time.strptime(line.split('\t')[-2],'%d/%m/%y %H:%M:%S'))
    

    .

    The results are

    Times of execution:

    0.03598 seconds for the first code with regex

    0.03580 seconds for the second code with split(‘\t’)

    that is to say the same

    This algorithm is faster than a code using a function cmp() :

    a code in which the set of lines output isn’t transformed in a list of tuples by

    output = [ (kl(line),line.rstrip()) for line in output]
    

    but is only transformed in a list of the lines (without duplicates, then) and sorted with a function mycmp() (see the doc):

    def mycmp(a,b):
        return cmp(time.mktime(time.strptime(a.split('\t')[-2],'%d/%m/%y %H:%M:%S')),
                   time.mktime(time.strptime(b.split('\t')[-2],'%d/%m/%y %H:%M:%S')))
    
    output = [ line.rstrip() for line in output] # not list(output) , to avoid the problem of newline of the last line of each file
    output.sort(mycmp)
    
    for line in output:
        fm.write(line+'\n')
    

    has an execution time of

    0.11574 seconds

    .

    The code:

    #!/usr/bin/env python
    
    import os, time, sys, re
    from sets import Set as sett
    
    def sorting_merge(o_file , c_file, m_file ):
    
        pat = re.compile('[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d'
                         '(?=[ \t]+[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d)') 
    
        def kl(line,pat = pat):
            return time.mktime(time.strptime((pat.search(line).group()),'%d/%m/%y %H:%M:%S'))
    
        output = sett()
        head = []
    
        fa = open(o_file)
        fa.readline() # first line is skipped
        while True:
            line1 = fa.readline()
            mat1  = pat.search(line1)
            if not mat1: head.append(line1) # line1 is here a line of the header
            else: break # the loop ends on the first line1 not being a line of the heading
        output = sett( fa )
        fa.close()
    
        fb = open(c_file)
        while True:
            line1 = fb.readline()
            if pat.search(line1):  break
        output = output.union(sett( fb ))
        fb.close()
    
        output = [ (kl(line),line.rstrip()) for line in output]
        output.sort()
    
        fm = open(m_file,'w')
        fm.write(time.strftime('On %d/%m/%y %H:%M:%S\n')+(''.join(head)))
        for t,line in output:
            fm.write(line + '\n')
        fm.close()
    
    
    te = time.clock()
    sorting_merge('ytre.txt','tataye.txt','merged.file.txt')
    print time.clock()-te
    

    This time, I hope it will run correctly, and that the only thing to do is to wait the times of execution on real files much bigger than the ones on which I tested the codes

    .

    EDIT 3

    pat = re.compile('[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d'
                     '(?=[ \t]+'
                     '[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d'
                     '|'
                     '[ \t]+aborted/deleted)')
    

    .

    EDIT 4

    #!/usr/bin/env python
    
    import os, time, sys, re
    from sets import Set
    
    def sorting_merge(o_file , c_file, m_file ):
    
        pat = re.compile('[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d'
                         '(?=[ \t]+'
                         '[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d'
                         '|'
                         '[ \t]+aborted/deleted)')
    
        def kl(line,pat = pat):
            return time.mktime(time.strptime((pat.search(line).group()),'%d/%m/%y %H:%M:%S'))
    
        head = []
        output = Set()
    
        fa = open(o_file)
        fa.readline() # first line is skipped
        for line1 in fa:
            if pat.search(line1):  break # first line after the heading
            else:  head.append(line1) # line of the header
        for line in fa:
            output.add(line.rstrip())
        output.add(line1.rstrip())
        fa.close()
    
        fb = open(c_file)
        for line1 in fb:
            if pat.search(line1):  break
        for line in fb:
            output.add(line.rstrip())
        output.add(line1.rstrip())
        fb.close()
    
        if '' in output:  output.remove('')
        output = [ (kl(line),line) for line in output]
        output.sort()
    
        fm = open(m_file,'w')
        fm.write(time.strftime('On %d/%m/%y %H:%M:%S\n')+(''.join(head)))
        for t,line in output:
            fm.write(line+'\n')
        fm.close()
    
    te = time.clock()
    sorting_merge('A.txt','B.txt','C.txt')
    print time.clock()-te
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

No related questions found

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.