I have two files with with the same number of columns, but a different

Question

0

Asked: June 3, 20262026-06-03T20:21:46+00:00 2026-06-03T20:21:46+00:00

I have two files with with the same number of columns, but a different

0

I have two files with with the same number of columns, but a different number of rows. One file is a list of timestamps and a list of words, the second file is a list of timestamps with a list of sounds in each of the words, i.e.,:

9640 12783 she
12783 17103 had
...

and:

9640 11240 sh
11240 12783 iy
12783 14078 hv
14078 16157 ae
16157 16880 dcl
16880 17103 d
...

I want to merge these two files and create a list of entries with the word as one value, and the phonetic transcription as the other, i.e.,:

[['she', 'sh iy']
 ['had', 'hv ae dcl d']
  ...

I’m a complete Python (and programming) noob, but my original idea was to do this by searching the second file for the second field in the first file, and then appending them into a list. I tried doing it this way:

word = open('SA1.WRD','r')
phone = open('SA1.PHN','r')
word_phone = []

for line in word.readlines():
    words = line.split()
    word = words[2]
    word_phone.append(word)

for line in phone.readlines():
    phones = line.split()
    phone = phones[2]
    if int(phones[1]) <= int(words[1]):
        word_phone.append(phone)

print word_phone

This is the output:

['she', 'had', 'your', 'dark', 'suit', 'in', 'greasy', 'wash', 'water', 'all', 'year', 'sh', 'iy', 'hv', 'ae', 'dcl', 'd', 'y', 'er', 'dcl', 'd', 'aa', 'r', 'kcl', 'k', 's', 'uw', 'dx', 'ih', 'ng', 'gcl', 'g', 'r', 'iy', 's', 'iy', 'w', 'aa', 'sh', 'epi', 'w', 'aa', 'dx', 'er', 'q', 'ao', 'l', 'y', 'iy', 'axr']

As I said, I’m a total noob, and some suggestions would be very helpful.

Update:
I’d like to revisit this question if possible. I’ve modified Lattyware’s code to operate on a directory:

phns = []
wrds = []
for root, dir, files in os.walk(sys.argv[1]):
    wrds = wrds + [ os.path.join( root, f ) for f in files if f.endswith( '.WRD' ) ]
    phns = phns + [ os.path.join( root, f ) for f in files if f.endswith( '.PHN' ) ]
phns.sort()
wrds.sort()
files = (zip(wrds,phns))

#OPEN THE WORD AND PHONE FILES, COMPARE THEM
output = []
for file in files:
    with open( file[0] ) as unsplit_words, open( file[1] ) as unsplit_sounds:
        sounds = (line.split() for line in unsplit_sounds)
        words = (line.split() for line in unsplit_words)
        output = output +  [
          (word, " ".join(sound for _, _, sound in
                    takeuntil(sounds, stop)))
                for start, stop, word in words
            ]

There is some information I would like to retain in the filepaths of these files. I was wondering how I might go about appending the split file path to the tuples in the list this code returns, e.g.,

[('she', 'sh iy', 'directory', 'subdirectory'), ('had', 'hv ae dcl d', 'directory', subdirectory')]

I figured I could I could split the paths and then zip the lists together, but there are 53,000 total items in the list the code above outputs, but only 6300 file pairs being processed.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-03T20:21:49+00:00

This is a task where the main issue is matching the sounds with the words. Fortunately, this is easy to do as we can simply take all the sounds until they match the words end time.

To do this, we must construct a takeuntil() function – itertools.takewhile() (my original solution) unfortunately takes an extra value, so this is the best solution.

def takeuntil(iterable, stop):
    for x in iterable:
        yield x
        if x[1] == stop:
            break

with open("SA1.WRD") as unsplit_words, open("SA1.PHN") as unsplit_sounds:
    sounds = (line.split() for line in unsplit_sounds)
    words = (line.split() for line in unsplit_words)
    output = [
        (word, " ".join(sound for _, _, sound in takeuntil(sounds, stop)))
        for start, stop, word in words
    ]

print(output)

Gives us:

[('she', 'sh iy'), ('had', 'hv ae dcl d')]

This code uses the with statement for readability and closing the files (even on exceptions). It also makes a lot of use of list comprehensions and generator expressions.

There are some bad patterns in your code. Your use of open() without the with statement is a bad idea, and using readlines() isn’t needed (loop directly over the file – it’s lazy and therefore far more efficient in most cases, not to mention nicer to read and less to type).

So how does this work? Let’s run through it:

First we open both our files to read from, and throw in quick generator expressions to split the lines in the files.

Next comes a bit of a monster list comprehension. What we do in this is take sounds from our sounds iterable until we reach the last sound belonging to the word we are on, then move onto the next word, returning the word and the list of associated sounds. We then use str.join() to join the sounds into a single string.

If you have trouble understanding the thought process, then here is an expanded version that works the same way, albeit much less efficiently due to the python-side loops (generators and list comprehensions make the above far quicker):

with open("SA1.WRD") as words, open("SA1.PHN") as sounds:
    output = []
    current = []
    for line in words:
        start, stop, word = line.split()
        for sound_line in sounds:
            sound_start, sound_stop, sound = sound_line.split()
            current.append(sound)
            if sound_stop == stop:
                break
        output.append((word, " ".join(current)))
        current = []

print(output)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have two files with with the same number of columns, but a different

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply