I have two files with with the same number of columns, but a different number of rows. One file is a list of timestamps and a list of words, the second file is a list of timestamps with a list of sounds in each of the words, i.e.,:
9640 12783 she
12783 17103 had
...
and:
9640 11240 sh
11240 12783 iy
12783 14078 hv
14078 16157 ae
16157 16880 dcl
16880 17103 d
...
I want to merge these two files and create a list of entries with the word as one value, and the phonetic transcription as the other, i.e.,:
[['she', 'sh iy']
['had', 'hv ae dcl d']
...
I’m a complete Python (and programming) noob, but my original idea was to do this by searching the second file for the second field in the first file, and then appending them into a list. I tried doing it this way:
word = open('SA1.WRD','r')
phone = open('SA1.PHN','r')
word_phone = []
for line in word.readlines():
words = line.split()
word = words[2]
word_phone.append(word)
for line in phone.readlines():
phones = line.split()
phone = phones[2]
if int(phones[1]) <= int(words[1]):
word_phone.append(phone)
print word_phone
This is the output:
['she', 'had', 'your', 'dark', 'suit', 'in', 'greasy', 'wash', 'water', 'all', 'year', 'sh', 'iy', 'hv', 'ae', 'dcl', 'd', 'y', 'er', 'dcl', 'd', 'aa', 'r', 'kcl', 'k', 's', 'uw', 'dx', 'ih', 'ng', 'gcl', 'g', 'r', 'iy', 's', 'iy', 'w', 'aa', 'sh', 'epi', 'w', 'aa', 'dx', 'er', 'q', 'ao', 'l', 'y', 'iy', 'axr']
As I said, I’m a total noob, and some suggestions would be very helpful.
Update:
I’d like to revisit this question if possible. I’ve modified Lattyware’s code to operate on a directory:
phns = []
wrds = []
for root, dir, files in os.walk(sys.argv[1]):
wrds = wrds + [ os.path.join( root, f ) for f in files if f.endswith( '.WRD' ) ]
phns = phns + [ os.path.join( root, f ) for f in files if f.endswith( '.PHN' ) ]
phns.sort()
wrds.sort()
files = (zip(wrds,phns))
#OPEN THE WORD AND PHONE FILES, COMPARE THEM
output = []
for file in files:
with open( file[0] ) as unsplit_words, open( file[1] ) as unsplit_sounds:
sounds = (line.split() for line in unsplit_sounds)
words = (line.split() for line in unsplit_words)
output = output + [
(word, " ".join(sound for _, _, sound in
takeuntil(sounds, stop)))
for start, stop, word in words
]
There is some information I would like to retain in the filepaths of these files. I was wondering how I might go about appending the split file path to the tuples in the list this code returns, e.g.,
[('she', 'sh iy', 'directory', 'subdirectory'), ('had', 'hv ae dcl d', 'directory', subdirectory')]
I figured I could I could split the paths and then zip the lists together, but there are 53,000 total items in the list the code above outputs, but only 6300 file pairs being processed.
This is a task where the main issue is matching the sounds with the words. Fortunately, this is easy to do as we can simply take all the sounds until they match the words end time.
To do this, we must construct a
takeuntil()function –itertools.takewhile()(my original solution) unfortunately takes an extra value, so this is the best solution.Gives us:
This code uses the
withstatement for readability and closing the files (even on exceptions). It also makes a lot of use of list comprehensions and generator expressions.There are some bad patterns in your code. Your use of
open()without thewithstatement is a bad idea, and usingreadlines()isn’t needed (loop directly over the file – it’s lazy and therefore far more efficient in most cases, not to mention nicer to read and less to type).So how does this work? Let’s run through it:
First we open both our files to read from, and throw in quick generator expressions to split the lines in the files.
Next comes a bit of a monster list comprehension. What we do in this is take sounds from our
soundsiterable until we reach the last sound belonging to the word we are on, then move onto the next word, returning the word and the list of associated sounds. We then usestr.join()to join the sounds into a single string.If you have trouble understanding the thought process, then here is an expanded version that works the same way, albeit much less efficiently due to the python-side loops (generators and list comprehensions make the above far quicker):