Debugging hell

Linus Torvalds's picture

So I've spent much of the last couple of days remotely debugging this insane suspend failure (or to be exact, resume failure) that happens occasionally for a couple of people.

Now, suspend/resume debugging is some of the nastiest crud around, because when you suspend a machine, you end up (obviously) having to turn all the devices off. And guess what? That also means that you have no way to then inform the user about what is going on when things go wrong, because all those nice devices (like the screen - duh) will not be available. So no screen output, no serial console traces, no network dumps, no nothing.

To make matters worse, we even know how to trigger the problem (on those particular machines, neither of which are mine), but the particular PCI resource layout that is needed seems to have nothing what-so-ever to do with the actual failure itself. It seems to be just a way to trigger it, nothing more.

(And that's also why I've been debugging it personally - the whole resource allocation thing is one of the areas where very few other people know how things work. Most of the time I can try to prod others into looking at the bugs, but in this case it was one of those rare "Linus or nobody" choices).

So I'm frustrated. I'm doubly frustrated because it's a reasonably recent Intel chipset, and some simple debugging facilities is the one thing I've been asking Intel to add to the core chipset for the last several years so that we could do some kind of sane tracing over complete failures where all other devices are unavailable and you have to power off the machine to get it back.

Grr.

I want to be back under water.