If a program violates its instruction path and/or memory data the OS halts it with some message due to the program running in the ‘virtual machine’ like space of the OS and its unable to determine its next instruction.
The OS in tern is also a program, sharing the machine resources as any other program and can halt in a similar fashion but it’s sometimes healthy enough to display some debugging info and blue screen. So as a programmer I’m thinking, if I can do that – emit debugging info and make the screen blue why wouldn’t I be able to try to recover the OS altogether instead of requiring a cold reboot ? After all its the OS – it’s supposed to be the rock solid foundation (not talking about Windows of course) of all software, if the space shuttle ran Windows then what would happen – it won’t recover ?:)
So: is it only that MS hasn’t taken care of trying everything to recover to the point that a reboot is not required or is it some other more deeper problem that has stop companies like MS to be unable to do that ?
You can’t recover the OS for the same reasons a user-space program can’t recover — when certain types of errors are seen it means that your program is in an undefined state and therefore can’t recover. Even if the problem in some sense isn’t fatal (i.e. doesn’t cause the program to immediately die), it’s not safe to continue because things are or are likely corrupted.
For example, be it a user-space program or the OS kernel, say a buffer overrun or an messed up pointer causes the stack to be corrupted. How is the program supposed to recover from that? With a blown stack when the function that is currently executing ends, where will it return to? The return address is likely gone. Now what?
And it’s not just Microsoft. Ever hear of a “kernel panic” in Unix?