This question has been asked here in one form or another but not quite

Question

0

Asked: May 20, 20262026-05-20T21:53:30+00:00 2026-05-20T21:53:30+00:00

This question has been asked here in one form or another but not quite

0

This question has been asked here in one form or another but not quite the thing I’m looking for. So, this is the situation I shall be having: I already have one file, named file_a and I’m creating another file – file_b. file_a is always bigger than file_b in size. There will be a number of duplicate lines in file_b (hence, in file_a as well) but both the files will have some unique lines. What I want to do is: to copy/merge only the unique lines from file_a to file_b and then sort the line order, so that the file_b becomes the most up-to-date one with all the unique entries. Either of the original files shouldn’t be more than 10MB in size. What’s the most efficient (and fastest) way I can do that?

I was thinking something like that, which does the merging alright.

#!/usr/bin/env python

import os, time, sys

# Convert Date/time to epoch
def toEpoch(dt):
    dt_ptrn = '%d/%m/%y %H:%M:%S'
    return int(time.mktime(time.strptime(dt, dt_ptrn)))

# input files
o_file = "file_a"
c_file = "file_b"
n_file = [o_file,c_file]

m_file = "merged.file"

for x in range(len(n_file)):
    P = open(n_file[x],"r")
    output = P.readlines()
    P.close()

    # Sort the output, order by 2nd last field
    #sp_lines = [ line.split('\t') for line in output ]
    #sp_lines.sort( lambda a, b: cmp(toEpoch(a[-2]),toEpoch(b[-2])) )

    F = open(m_file,'w') 
    #for line in sp_lines:
    for line in output:
        if "group_" in line:
            F.write(line)
    F.close()

But, it’s:

not with only the unique lines
not sorted (by next to last field)
and introduces the 3rd file i.e. m_file

Just a side note (long story short): I can’t use sorted() here as I’m using v2.3, unfortunately. The input files look like this:

On 23/03/11 00:40:03
JobID   Group.User          Ctime   Wtime   Status  QDate               CDate
===================================================================================
430792  group_atlas.pltatl16    0   32  4   02/03/11 21:52:38   02/03/11 22:02:15
430793  group_atlas.atlas084    30  472 4   02/03/11 21:57:43   02/03/11 22:09:35
430794  group_atlas.atlas084    12  181 4   02/03/11 22:02:37   02/03/11 22:05:42
430796  group_atlas.atlas084    8   185 4   02/03/11 22:02:38   02/03/11 22:05:46

I tried to use cmp() to sort by the 2nd last field but, I think, it doesn’t work just because of the first 3 lines of the input files.

Can anyone please help? Cheers!!!

Update 1:

For the future reference, as suggested by Jakob, here is the complete script. It worked just fine.

#!/usr/bin/env python

import os, time, sys
from sets import Set as set

def toEpoch(dt):
    dt_ptrn = '%d/%m/%y %H:%M:%S'
    return int(time.mktime(time.strptime(dt, dt_ptrn)))

def yield_lines(fileobj):
    #I want to discard the headers
    for i in xrange(3):
        fileobj.readline()
    #
    for line in fileobj:
        yield line

def app(path1, path2):
    file1 = set(yield_lines(open(path1)))
    file2 = set(yield_lines(open(path2)))
    return file1.union(file2)

# Input files
o_file = "testScript/03"
c_file = "03.bak"
m_file = "finished.file"

print time.strftime('%H:%M:%S', time.localtime())

# Sorting the output, order by 2nd last field
sp_lines = [ line.split('\t') for line in app(o_file, c_file) ]
sp_lines.sort( lambda a, b: cmp(toEpoch(a[-2]),toEpoch(b[-2])) )

F = open(m_file,'w')
print "No. of lines: ",len(sp_lines)

for line in sp_lines:

    MF = '\t'.join(line)
    F.write(MF)
F.close()

It took about 2m:47s to finish for 145244 lines.

[testac1@serv07 ~]$ ./uniq-merge.py 
17:19:21
No. of lines:  145244
17:22:08

thanks!!

Update 2:

Hi eyquem, this is the Error message I get when I run your script(s).

From the first script:

[testac1@serv07 ~]$ ./uniq-merge_2.py 
  File "./uniq-merge_2.py", line 44
    fm.writelines( '\n'.join(v)+'\n' for k,v in output )
                                       ^
SyntaxError: invalid syntax

From the second script:

[testac1@serv07 ~]$ ./uniq-merge_3.py 
  File "./uniq-merge_3.py", line 24
    output = sett(line.rstrip() for line in fa)
                                  ^
SyntaxError: invalid syntax

Cheers!!

Update 3:

The previous one wasn’t sorting the list at all. Thanks to eyquem to pointing that out. Well, it does now. This is a further modification to Jakob’s version – I converted the set:app(path1, path2) to a list:myList() and then applied the sort( lambda … ) to the myList to sort the merged file by the nest to last field. This is the final script.

#!/usr/bin/env python

import os, time, sys
from sets import Set as set

def toEpoch(dt):
    # Convert date/time to epoch
    dt_ptrn = '%d/%m/%y %H:%M:%S'
    return int(time.mktime(time.strptime(dt, dt_ptrn)))

def yield_lines(fileobj):
    # Discard the headers (1st 3 lines)
    for i in xrange(3):
        fileobj.readline()

    for line in fileobj:
        yield line

def app(path1, path2):
    # Remove duplicate lines
    file1 = set(yield_lines(open(path1)))
    file2 = set(yield_lines(open(path2)))
    return file1.union(file2)

print time.strftime('%H:%M:%S', time.localtime())

# I/O files
o_file = "testScript/03"
c_file = "03.bak"
m_file = "finished.file"

# Convert set into to list
myList = list(app(o_file, c_file))

# Sort the list by the date
sp_lines = [ line.split('\t') for line in myList ]
sp_lines.sort( lambda a, b: cmp(toEpoch(a[-2]),toEpoch(b[-2])) )

F = open(m_file,'w')
print "No. of lines: ",len(sp_lines)

# Finally write to the outFile
for line in sp_lines:
    MF = '\t'.join(line)
    F.write(MF)
F.close()

There is no speed boost at all, it took 2m:50s to process the same 145244 lines. Is anyone see any scope of improvement, please let me know. Thanks to Jakob and eyquem for their time. Cheers!!

Update 4:

Just for future reference, this is a modified version of eyguem, which works much better and faster then the previous ones.

#!/usr/bin/env python

import os, sys, re
from sets import Set as sett
from time import mktime, strptime, strftime

def sorting_merge(o_file, c_file, m_file ):

    # RegEx for Date/time filed
    pat = re.compile('[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d')

    def kl(lines,pat = pat):
        # match only the next to last field
        line = lines.split('\t')
        line = line[-2]
        return mktime(strptime((pat.search(line).group()),'%d/%m/%y %H:%M:%S'))

    output = sett()
    head = []

    # Separate the header & remove the duplicates
    def rmHead(f_n):
        f_n.readline()
        for line1 in f_n:
            if pat.search(line1):  break
            else:  head.append(line1) # line of the header
        for line in f_n:
            output.add(line.rstrip())
        output.add(line1.rstrip())
        f_n.close()

    fa = open(o_file, 'r')
    rmHead(fa)

    fb = open(c_file, 'r')
    rmHead(fb)

    # Sorting date-wise
    output = [ (kl(line),line.rstrip()) for line in output if line.rstrip() ]
    output.sort()

    fm = open(m_file,'w')
    # Write to the file & add the header
    fm.write(strftime('On %d/%m/%y %H:%M:%S\n')+(''.join(head[0]+head[1])))
    for t,line in output:
        fm.write(line + '\n')
    fm.close()


c_f = "03_a"
o_f = "03_b"

sorting_merge(o_f, c_f, 'outfile.txt')

This version is much faster – 6.99 sec. for 145244 lines compare to the 2m:47s – then the previous one using lambda a, b: cmp(). Thanks to eyquem for all his support. Cheers!!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-20T21:53:30+00:00

EDIT 2

My previous codes have problems with output = sett(line.rstrip() for line in fa) and output.sort(key=kl)

Moreover, they have some complications.

So I examined the choice of reading the files directly with a set() function taken by Jakob Bowyer in his code.

Congratulations Jakob ! (and Michal Chruszcz by the way) : set() is unbeatable, it’s faster than a reading one line at a time.

Then , I abandonned my idea to read the files line after line.

.

But I kept my idea to avoid a sorting with the help of cmp() function because, as it is described in the doc:

s.sort([cmpfunc=None])

The sort() method takes an optional
argument specifying a comparison
function of two arguments (list items)
(…) Note that this slows the sorting
process down considerably

http://docs.python.org/release/2.3/lib/typesseq-mutable.html

Then, I managed to obtain a list of tuples (t,line) in which the t is

time.mktime(time.strptime(( 1st date-and-hour in line ,'%d/%m/%y %H:%M:%S'))

by the instruction

output = [ (kl(line),line.rstrip()) for line in output]

.

I tested 2 codes. The following one in which 1st date-and-hour in line is computed thanks to a regex:

def kl(line,pat = pat):
    return time.mktime(time.strptime((pat.search(line).group()),'%d/%m/%y %H:%M:%S'))

output = [ (kl(line),line.rstrip()) for line in output if line.rstrip()]

output.sort()

And a second code in which kl() is:

def kl(line,pat = pat):
    return time.mktime(time.strptime(line.split('\t')[-2],'%d/%m/%y %H:%M:%S'))

.

The results are

Times of execution:

0.03598 seconds for the first code with regex

0.03580 seconds for the second code with split(‘\t’)

that is to say the same

This algorithm is faster than a code using a function cmp() :

a code in which the set of lines output isn’t transformed in a list of tuples by

output = [ (kl(line),line.rstrip()) for line in output]

but is only transformed in a list of the lines (without duplicates, then) and sorted with a function mycmp() (see the doc):

def mycmp(a,b):
    return cmp(time.mktime(time.strptime(a.split('\t')[-2],'%d/%m/%y %H:%M:%S')),
               time.mktime(time.strptime(b.split('\t')[-2],'%d/%m/%y %H:%M:%S')))

output = [ line.rstrip() for line in output] # not list(output) , to avoid the problem of newline of the last line of each file
output.sort(mycmp)

for line in output:
    fm.write(line+'\n')

has an execution time of

0.11574 seconds

.

The code:

#!/usr/bin/env python

import os, time, sys, re
from sets import Set as sett

def sorting_merge(o_file , c_file, m_file ):

    pat = re.compile('[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d'
                     '(?=[ \t]+[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d)') 

    def kl(line,pat = pat):
        return time.mktime(time.strptime((pat.search(line).group()),'%d/%m/%y %H:%M:%S'))

    output = sett()
    head = []

    fa = open(o_file)
    fa.readline() # first line is skipped
    while True:
        line1 = fa.readline()
        mat1  = pat.search(line1)
        if not mat1: head.append(line1) # line1 is here a line of the header
        else: break # the loop ends on the first line1 not being a line of the heading
    output = sett( fa )
    fa.close()

    fb = open(c_file)
    while True:
        line1 = fb.readline()
        if pat.search(line1):  break
    output = output.union(sett( fb ))
    fb.close()

    output = [ (kl(line),line.rstrip()) for line in output]
    output.sort()

    fm = open(m_file,'w')
    fm.write(time.strftime('On %d/%m/%y %H:%M:%S\n')+(''.join(head)))
    for t,line in output:
        fm.write(line + '\n')
    fm.close()


te = time.clock()
sorting_merge('ytre.txt','tataye.txt','merged.file.txt')
print time.clock()-te

This time, I hope it will run correctly, and that the only thing to do is to wait the times of execution on real files much bigger than the ones on which I tested the codes

.

EDIT 3

pat = re.compile('[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d'
                 '(?=[ \t]+'
                 '[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d'
                 '|'
                 '[ \t]+aborted/deleted)')

.

EDIT 4

#!/usr/bin/env python

import os, time, sys, re
from sets import Set

def sorting_merge(o_file , c_file, m_file ):

    pat = re.compile('[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d'
                     '(?=[ \t]+'
                     '[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d'
                     '|'
                     '[ \t]+aborted/deleted)')

    def kl(line,pat = pat):
        return time.mktime(time.strptime((pat.search(line).group()),'%d/%m/%y %H:%M:%S'))

    head = []
    output = Set()

    fa = open(o_file)
    fa.readline() # first line is skipped
    for line1 in fa:
        if pat.search(line1):  break # first line after the heading
        else:  head.append(line1) # line of the header
    for line in fa:
        output.add(line.rstrip())
    output.add(line1.rstrip())
    fa.close()

    fb = open(c_file)
    for line1 in fb:
        if pat.search(line1):  break
    for line in fb:
        output.add(line.rstrip())
    output.add(line1.rstrip())
    fb.close()

    if '' in output:  output.remove('')
    output = [ (kl(line),line) for line in output]
    output.sort()

    fm = open(m_file,'w')
    fm.write(time.strftime('On %d/%m/%y %H:%M:%S\n')+(''.join(head)))
    for t,line in output:
        fm.write(line+'\n')
    fm.close()

te = time.clock()
sorting_merge('A.txt','B.txt','C.txt')
print time.clock()-te

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

This question has been asked here in one form or another but not quite

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply