Unlikely failures triggered computer shutdowns
BY WILLIAM HARWOOD
Posted: April 29, 2001 at 8:30 a.m. EDT
After five days of extensive trouble shooting, engineers now believe the culprit responsible for crippling the international space station's computer system was the unlikely, near-simultaneous failures of two critical hard drives.
The station is equipped with three command-and-control computers, called C&C MDMs, that oversee the operation of the lab's stabilizing gyroscopes, its high-speed communications links with Earth, the station's robot arm and other critical systems.
Each computer uses an internal hard drive to store programs and system software. Only one computer is in control at any given moment with a second serving as a hot backup and the third in standby.
Late Tuesday, C&C-1 suddenly dropped off line. C&C-2, which was the backup machine then switched to primary mode but it, too, experienced hard drive errors and flight controllers decided to call up C&C-3. That computer promptly failed.
Engineers now believe the hard drive problem with C&C-2 was a known issue and not an outright failure. The other two drives apparently failed.
C&C-2's hard drive ultimately was restored and that computer is fully operational in primary mode. Flight controllers then uplinked critical C&C software into C&C-3's dynamic random access (DRAM) memory and that machine - sans hard drive - is considered operational.
C&C-1 was replaced with an identical payload computer. DRAM in the new C&C-1 computer has been loaded with command software, but engineers have not yet been able to reformat its hard drive. They hope to do so shortly.
Larry McWhorter, deputy manager of the avionics and software office at the Johnson Space Center in Houston, briefed the station astronauts today on the progress of the troubleshooting and plans to ensure similar failures do not happen in the future.
"What we found in going through the scenarios is that both C&C-1 and C&C-3 appear to have failed disk drives," he said. "Neither of the drives are spinning, so they would be of no use and no way of getting files from them.
"We know that you have R&R'd C&C-1 and we're bringing it back to the ground to have Honeywell, who is the vendor, do a complete review of the box and to identify what the problem is there. The signature of C&C-3, looking at the data, is very similar to C&C-1, so we hope to learn from 1 what's going on in 3.
"C&C-2, that disk appears to be functioning nominally," McWhorter said. "There've been some nuisance items that have come up and we have procedures to work through and C&C-2 is functioning fine at this point in time.
"We have taken C&C-3, which has the failed disk, and have loaded the necessary software into DRAM. That is functioning at this time as the backup in the case that C&C-2 had a failure. We've also loaded into C&C-1 the DRAM software. We've made one attempt so far to format the disk so that we can transfer all the files from C&C-2 to C&C-1 that we would need to make it operate nominally. We had a failure in that reformatting process.
"One comment: When we've done this process (on the ground), we normally have seen it take two to three attempts to get through this full format of the disk. So this failure was not a surprise and we're going to continue to work to get that disk formatted so C&C-1 will have full capability."
Station astronaut James Voss then asked the obvious question.
"Larry, had you all seen some degradation or something that would have led you to believe we would have had two failures like this at almost the same time?" Voss asked. "It just seems kind of strange that two pieces of hardware would fail like this in the same way at just about the same time."
"It was a total surprise to us," McWhorter said. "We've not seen anything in the behavior of either piece of hardware that would have led us to suspect the failures were coming. That's why we're very interested in getting C&C-1 down to the ground so we can understand what happened to it."
He said engineers are continuing work to "get a better handle on the long-term solution."
"The first thing, we'll continue to try to understand the root cause of what happened to these two units, and that would include the work of diagnosing the failure in C&C-1 when it gets to the ground," he said. "We're also going to have a group look at how to build a new C&C out of the spares that you have on orbit. ... There's enough hardware there to build another C&C.
"We're also working to manifest a new controller or disk drive on a downstream flight to have another one on orbit to replace the one that's coming down," he continued. "The other thing we're doing is trying to look at how we could more rapidly get the MDM testing hardware that's being built and verified over the next few months on board so that we would have it to test MDMs that fail on orbit or cause problems."
And finally, NASA is looking at ways to accelerate development and delivery of solid state memory modules that could replace the hard drives altogether.
See the Status Center for full play-by-play coverage.