I’m trying to build a randomized dataset based on an input dataset.
The input dataset consists of 856471 lines, and in each line there is a pair of values separated by a tab.
NO entry from the randomized dataset can be equal to any of those in the input dataset, this means:
If the pair in line 1 is “Protein1 Protein2”, the randomized dataset cannot contain the following pairs:
- “Protein1 Protein2”
- “Protein2 Protein1”
In order to achieve this I tried the following:
data = infile.readlines()
ltotal = len(data)
for line in data:
words = string.split(line)
init = 0
while init != ltotal:
p1 = random.choice(words)
p2 = random.choice(words)
words.remove(p1)
words.remove(p2)
if "%s\t%s\n" % (p1, p2) not in data and "%s\t%s\n" % (p2, p1) not in data:
outfile.write("%s\t%s\n" % (p1, p2))
However, I’m getting the following error:
Traceback (most recent call last): File
"C:\Users\eduarte\Desktop\negcreator.py", line 46, in <module>
convert(indir, outdir) File "C:\Users\eduarte\Desktop\negcreator.py", line 27, in convert
p1 = random.choice(words) File "C:\Python27\lib\random.py", line 274, in choice
return seq[int(self.random() * len(seq))] # raises IndexError if seq is empty
IndexError: list index out of range
I was pretty sure this would work. What am I doing wrong?
Thanks in advance.
The variable
wordsis overwritten for each line in the loopThis is most probably not what you want.
Moreover, your
whileloop is an infinite loop, which will consumewordseventually, leaving no choices forrandom.choice().Edit: My guess is that you have a file of tab-separated word pairs, a pair in each line, and you are trying to form random pairs from all of the words, writing only those random pairs to the output file that do not occur in the original file. Here is some code doing this:
Notes:
A pair of words is represented by a
frozenset, since order does not matter.I use a
setfor all the pairs to be able to test if a pair is in the set in constant time.Instead of using
random.choice()repeatedly, I only shuffle the whole list once, and then iterate over it in pairs. This way, we don’t need to remove the already used words from the list, so it’s much more efficient. (This change an the previous one bring down the algorithmic complexity of the approach from O(n²) to O(n).)The expression
itertools.izip(*[iter(words)] * 2)is a common Python idiom to iterate overwordsin pairs, in case you did not encounter that one yet.The code is still untested.