I have a multithreaded .NET Windows Service that hangs intermittently — maybe once every two weeks of 24/7 operation. When the hangs occurs the threadpool is completely saturated because calls to our custom tracelistener start blocking for some reason. There aren’t any locks in the offending code nor anything blocking according to windbg, but they’re definitely blocking somewhere. There aren’t any exceptions on the stack either. There is a Thread.Sleep(1) that will occasionally be hit in the BufferedStream.Write code, but my question is what is the ReOpenMetaDataWithMemory, CreateApplicationContext, and DllCanUnloadNow mean?
Nearly all of the 2000 hung up worker threads (not normal operation!) on the ThreadPool have a stack similar to the following:
0:027> !dumpstack OS Thread Id: 0x1638 (27) Child-SP RetAddr Call Site 000000001d34df58 0000000077d705d6 ntdll!ZwDelayExecution+0xa 000000001d34df60 000006427f88901d kernel32!SleepEx+0x96 000000001d34e000 000006427f454379 mscorwks!DllCanUnloadNowInternal+0xf53d 000000001d34e080 000006427fa34749 mscorwks!CreateApplicationContext+0x41d 000000001d34e0e0 0000064280184902 mscorwks!ReOpenMetaDataWithMemory+0x1ff59 000000001d34e290 0000064280184532 Company_Common_Diagnostics!Company.Common.Diagnostics.BufferedStream.Write(Byte[], Int32, Int32)+0x1b2 000000001d34e300 00000642801831fd Company_Common_Diagnostics!Company.Common.Diagnostics.XmlRollingTraceListener+TraceWriter.Write(System.String)+0x52 000000001d34e350 00000642801b3304 Company_Common_Diagnostics!Company.Common.Diagnostics.XmlRollingTraceListener.InternalWrite(System.Text.StringBuilder)+0x3d 000000001d34e390 0000064274e9d7ec Company_Common_Diagnostics!Company.Common.Diagnostics.XmlRollingTraceListener.TraceTransfer(System.Diagnostics.TraceEventCache, System.String, Int32, System.String, System.Guid)+0xc4 000000001d34e410 00000642801b2f59 System_ni!System.Diagnostics.TraceSource.TraceTransfer(Int32, System.String, System.Guid)+0x2ec
Figured it out I believe. I got into the BufferStream and saw that it was in a state where anything that called into the TraceListener would just get stuck in a Thread.Sleep(1) loop. I hope this is the fix because I can’t for the life of me recreate the issue.
I had usegloballock=false and autoflush=true in the trace configuration. The flush method on the TraceListener was not thread-safe — the listener is meant to use data buffering, so on occasion the TraceListener would get in a bad state when there was concurrency of flushes and writes. The fix was to simply set autoflush=false. I can’t believe I didn’t catch this sooner.