I am writing a multithreaded Python program, in which I am firing off worker threads to process a list of input data. The data they process then requires network operations, which effectively makes them I/O bound (so the GIL isn’t a problem for me).
I’m getting an issue where multiple worker threads are apparently receiving the same input, but I can’t figure out why. As far as I can tell, I’m not sharing any thread-unsafe data between threads.
I’ve created a minimized version of what I’m trying to do. This program exhibits the problem without doing any of the I/O or anything:
#!/usr/bin/env python
import threading
import logging
import time
logging.basicConfig(level=logging.DEBUG,
format="%(threadName)-10s %(levelname)-7s %(message)s")
sema = threading.Semaphore(10)
# keep track of already-visited data in worker threads
seen = []
seenlock = threading.Lock()
def see(num):
try:
logging.info("see: look at %d", num)
with seenlock:
if num in seen:
# this should be unreachable if each thread processes a unique number
logging.error("see: already saw %d", num)
else:
seen.append(num)
time.sleep(0.3)
finally:
sema.release()
def main():
# start at 1, so that the input number matches the log's "Thread-#"
for i in xrange(1, 100):
sema.acquire() # prevent more than 10 simultaneous threads
logging.info("process %d", i)
threading.Thread(target=lambda: see(i)).start()
if __name__ == '__main__': main()
And some of the output:
MainThread INFO process 1
MainThread INFO process 2
Thread-1 INFO see: look at 2
Thread-2 INFO see: look at 2
MainThread INFO process 3
Thread-2 ERROR see: already saw 2
MainThread INFO process 4
Thread-3 INFO see: look at 4
Thread-4 INFO see: look at 4
MainThread INFO process 5
Thread-4 ERROR see: already saw 4
Thread-5 INFO see: look at 5
MainThread INFO process 6
Thread-6 INFO see: look at 6
MainThread INFO process 7
Thread-7 INFO see: look at 7
MainThread INFO process 8
Thread-8 INFO see: look at 8
MainThread INFO process 9
MainThread INFO process 10
The only possibly weird thing that I feel I am doing is to acquire a semaphore permit on a thread other than where it gets released, but semaphores should be thread-safe and unconcerned with who acquires and releases the permits, as long as there are the same number of each.
Confirmed on:
- Python 2.7.3 (Windows; build from python.org)
- Python 2.6.7 (Windows; cygwin dist)
- Python 2.6.6 (Linux; Debian dist)
What am I doing to cause my threads to share data?
This has nothing to do with threading. It has to do with the behavior of closures.
When you define a function and refer to a variable in the enclosing scope, the value of that variable is always the value of that variable in the enclosing scope, at the time of function invocation. Since these functions are all invoked after the end of the
forloop,x == 9for all 10 invocations.A simple way to fix the problem is to use a default value. In short, change this:
To this:
Or, better yet, use the full power of the
Threadconstructor (thanks to Joel Cornett for reminding me):