Extreme System Failures

Dwayne Phillips

Revision History
Revision 1.024 December 2002

Recently, while driving, my windshield wipers failed. I would turn them on, but they wouldn't move. After several minutes of trying, they would eventually work, but they had failed in the "off" state.

Several weeks later, I was driving in the rain with the wipers going. It stopped raining and I turned the wipers off, but they kept going. After a few minutes of fiddling with the wiper switch, the wipers stopped. This time they had failed in the "on" state.

These episodes remind me of a lesson that is pertinent to IT managers. That is, if a system can fail in one extreme, it can -- and probably will -- fail in another extreme. This applies especially to critical human systems such as IT management.

I saw this happen on a large systems project several years ago. We contracted with a firm to build a system that comprised hardware and software. The contract was to last three years and cost US $20 million. This contract had delays and cost overruns in its first year. The contractor's managers had failed by not paying adequate attention to the work. The contractor failed in the "off" state.

There were many high-level meetings by the buyer and contractor. We felt that the systems were too important to our users to cancel the contract. We agreed to allow the contractor four years and paid the $33 million it requested to complete the work. To correct for failed management, our managers insisted that the contractor's senior managers pay close attention to the work. The contractor complied -- senior managers held weekly meetings with the project manager and monthly meetings with our senior managers.

The project manager was consumed with preparing materials for these meetings. He spent so much time on the meetings that he had little time for managing the project. The project suffered because of another, but opposite, management fault. The contractor failed in the "on" state.

Managers on both sides interpreted the continued dismal performance as failures by the front-line managers. The contractor replaced the project manager, and I replaced the project leader on our side. I struggled for three years to provide just enough information to both sides, while not overburdening the project manager with things that took him away from managing the project. With lots of help, I was able to avoid failures at the extremes, and the project concluded successfully.

All IT managers should work toward preventing failures at the extremes. This involves how we interpret failures and how we correct for them. When a system fails, the first step is to look for the underlying causes. Ask the question, "If the system fails in one way, how can it fail in a related but opposite way?" The answer to this question can prevent expensive faults and fixes. My windshield wipers wouldn't come on because of a fault in the electrical contacts. If the electrical contacts can become stuck in an open state (always off), they can also become stuck in a closed state (always on).

The next step is to correct the system fault. With the troubled system project I described earlier, the key was how we corrected the management problems. The project had early problems because managers paid too little attention to the project. The problems continued because managers paid too much attention to the project. The easy but incorrect solution would have been to return to inattention by the managers. The correct but difficult solution was to correct without overcorrecting. The phrase consultant and writer Jerry Weinberg has used is: "act early, act small" [1]. One of the problems we had on our project was that the senior managers did not act for a year, and then they acted too much. The new project manager and I were able to make small and frequent corrections in our level of management and reporting.

Faults can teach us much about our systems. I try to examine faults to learn how a system might fail next. I also take care in correcting faults -- especially when those faults are in human systems.

Dwayne Phillips has worked as a systems and computer engineer for the US government since 1980. He has written *The Software Project Manager's Handbook, Principles that Work at Work* (http://www.amazon.com/exec/obidos/ASIN/0818683007/cutterinformatco ). He can be contacted at d.phillips@computer.org.

[1] For more about Jerry Weinberg, see http://www.geraldmweinberg.com .