I am trying to debug a device driver which apparently causes other
task to hang. It is deterministic that which task or at what time it
will hang.
Basically I got some error message from kernel saying that “task has
been blocked for more than 120 seconds”, along with some stack trace.
The hung task vary from sendmail to mkfs to pdflush(a kernel thread”.
And the topmost function in the stack trace vary from “getnstimeofday”
to “bio_submit” to “mark_locks_held”.
I am having a hard time debugging this as it’s very hard to locate the
problem. The stack trace provided by the kernel is not very helpful
neither. According to those stack traces, some of those hung process
are not even trying to grab a lock (like in the getnstimeofday
function), and I have no idea why they hang.
So I am wondering if anyone have some idea on how to debug such a
problem. Would kgdb be useful here, maybe by giving me exactly at what
point the process hangs, and what kind of lock it is waiting for?
Any suggestions are appreciated.
When you don’t have frame pointers enabled in the kernel, stack traces won’t be reliable, and it’s confusing you. The kernel resorts to scanning the entire stack for values that might be pointers into kernel code (i.e. potential return addresses). This means that past function calls that have already returned might still be printed.
If you had code that looked like this:
Your stack dump might appear something like:
As for how to debug your problem, I have two suggestions:
Knowing that some of the debug output is bad, use your own knowledge of the code to figure out the real call stack. It might help to look for the common functions across multiple stack dumps.
Recompile the kernel to use frame pointers.
The kernel will still print every value that looks like a return address, but it will flag the unreliable addresses with a “?”. So your stack dump might look like this instead: