I’m trying to build a randomized dataset based on an input dataset. The input

Question

0

Asked: June 8, 20262026-06-08T12:23:41+00:00 2026-06-08T12:23:41+00:00

I’m trying to build a randomized dataset based on an input dataset. The input

0

I’m trying to build a randomized dataset based on an input dataset.
The input dataset consists of 856471 lines, and in each line there is a pair of values separated by a tab.
NO entry from the randomized dataset can be equal to any of those in the input dataset, this means:

If the pair in line 1 is “Protein1 Protein2”, the randomized dataset cannot contain the following pairs:

“Protein1 Protein2”
“Protein2 Protein1”

In order to achieve this I tried the following:

data = infile.readlines()
ltotal = len(data)
for line in data:
    words = string.split(line)

init = 0
while init != ltotal:
    p1 = random.choice(words)
    p2 = random.choice(words)
    words.remove(p1)
    words.remove(p2)
    if "%s\t%s\n" % (p1, p2) not in data and "%s\t%s\n" % (p2, p1) not in data:
        outfile.write("%s\t%s\n" % (p1, p2))

However, I’m getting the following error:

Traceback (most recent call last):   File
"C:\Users\eduarte\Desktop\negcreator.py", line 46, in <module>
    convert(indir, outdir)   File "C:\Users\eduarte\Desktop\negcreator.py", line 27, in convert
    p1 = random.choice(words)   File "C:\Python27\lib\random.py", line 274, in choice
    return seq[int(self.random() * len(seq))]  # raises IndexError if seq is empty
IndexError: list index out of range

I was pretty sure this would work. What am I doing wrong?
Thanks in advance.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-08T12:23:44+00:00

The variable words is overwritten for each line in the loop

for line in data:
    words = string.split(line)

This is most probably not what you want.

Moreover, your while loop is an infinite loop, which will consume words eventually, leaving no choices for random.choice().

Edit: My guess is that you have a file of tab-separated word pairs, a pair in each line, and you are trying to form random pairs from all of the words, writing only those random pairs to the output file that do not occur in the original file. Here is some code doing this:

import itertools
import random
with open("infile") as infile:
    pairs = set(frozenset(line.split()) for line in infile)
words = list(itertools.chain.from_iterable(pairs))
random.shuffle(words)
with open("outfille", "w") as outfile:
    for pair in itertools.izip(*[iter(words)] * 2):
        if frozenset(pair) not in pairs:
            outfile.write("%s\t%s\n" % pair)

Notes:

A pair of words is represented by a frozenset, since order does not matter.
I use a set for all the pairs to be able to test if a pair is in the set in constant time.
Instead of using random.choice() repeatedly, I only shuffle the whole list once, and then iterate over it in pairs. This way, we don’t need to remove the already used words from the list, so it’s much more efficient. (This change an the previous one bring down the algorithmic complexity of the approach from O(n²) to O(n).)
The expression itertools.izip(*[iter(words)] * 2) is a common Python idiom to iterate over words in pairs, in case you did not encounter that one yet.
The code is still untested.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to build a randomized dataset based on an input dataset. The input

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply