This is so infuriating! >_<
I’ve written a huge, complicated Haskell library. I wrote a small test program, and so far I’ve spent about 8 hours trying to figure out why the hell it keeps crashing on me. Sometimes GHC complains about a “strange closure type”. Sometimes I just get a segfault. Clearly the problem is memory corruption.
The library itself is 100% pure Haskell. However, the test program uses several unsafe GHC primitives relating to arrays. This is obviously what’s causing the problem. Indeed, if I comment out the writeArray# line, the program stops crashing. But this is utterly frying my noodle… as best as I can tell, all the array bounds I’ve used are perfectly valid. The program prints them all out, and they’re all positive and less than the array size.
I wrote a second program that does the same thing as the first one, but without involving the huge, complex library. I’ve tried and tried and tried, but I can’t make it crash at all. Nothing I do seems to make it crash, and yet it does almost exactly the same thing with the actual arrays.
Does anybody have any further troubleshooting tips? Is there some way I can track down the exact moment when memory is getting corrupted? (Rather than just the moment when the system notices the corruption.)
Update:
What does the problem do?
Well, essentially, it creates an array representing a pixel buffer. It spawns one thread that iterates over every pixel and writes the corresponding value into it. And it spawns a second thread that reads the array, and writes the pixels to a network socket using a fairly complicated protocol. (Hence the large library I’m trying to test.)
If I don’t spawn the writer thread, the crash goes away. If I comment out the writeArray' call in the writer thread, the crash goes away. Before writing each pixel, the writer thread prints out the pixel coordinates and the array index. Everything it prints out looks perfectly A-OK. And yet… it will not stop crashing.
I almost wonder if GHC’s array primitives aren’t thread-safe or something. (In case it makes any difference, the copy of the array that the reader thread looks like has been unsafe-frozen, while the writer thread continues to concurrently mutates it.)
However, I’ve written a program that does the exact same thing, but without sending traffic over the network. This program works perfectly in every detail. It’s only the really complicated program that won’t work. How annoying is that?!
This works: http://hpaste.org/70987
This does not: http://hpaste.org/70988
Might be fixed:
I changed something, and the code doesn’t crash any more. That might be just a fluke, or I might have “really” fixed the problem. It’s hard to say.
Hypothesis:
The problem appears to be reading from mutable and immutable copies of the same array from concurrent threads. (Even though the simplified test does exactly this and doesn’t crash.)
I made the network thread read from an unrelated immutable array, and the crashing stopped. I even added a loop to copy the data from one mutable array to another, fresh, mutable array, and the fresh one then gets frozen and inspected. This appears to work perfectly.
So it appears it’s just a glitch in GHC’s handling of concurrent accesses to both versions of the same array.
(Either that, or it’s a fluke. A few times I’ve changed something in the program and it’s stopped crashing, and then started crashing again…)
Update: This appears to be completely fixed. I haven’t had any more crashes since I made this alteration. Thanks to all the people who helped me out here. 🙂