Reconstructing Spirit's hopeful road to recovery
Posted: January 26, 2004

NASA's Mars Exploration Rover Spirit appeared to be teetering on the brink of failure last week when ground controllers lost contact with the craft sitting in Gusev Crater, its arm extended to a rock as the scientific adventure was beginning. Now, engineers are cautiously hopeful that Spirit will soon be restored to full working order.

"Spirit is doing better. It is kind of like we have a patient in re-hab here, and we are nursing her back to health," Jennifer Trosper, rover mission manager, said Monday.

Jennifer Trosper, Spirit mission manager for surface operations. Credit: NASA/Bill Ingalls
It is now believed that the rover's flash memory had become so full of files that the craft couldn't manage all of the information stored aboard. Spirit bogged down because it didn't have enough random access memory, or RAM, to handle the current amount of files in the flash -- including data recorded during its cruise from Earth to Mars and the 18 days of operations on the red planet's surface.

"I think we just found an issue with the number of files that eventually were on the spacecraft at this time in the mission that we were unaware of because of the accumulation that happened over the course of cruise and the 18 sols on the surface," Trosper said.

Flash memory is used in electronics, such as digital cameras, because it retains stored information even when the power is turned off. The rover also has random access memory, which doesn't keep stored data when the rover goes to sleep each night.

Controllers are preparing to delete hundreds of cruise files in hopes of lessening the burden.

"We don't know yet whether Spirit will be perfect again. Our current theory is one in which software would fix the problem," Trosper said. "There are other health checks that we have to do with the flash, the high-gain antenna, the Pancam Mast Assembly and the motor control board to make sure our current theory fully checks out."

Some triggering event, not yet fully pinpointed, caused the rover's computer brain to begin a continuous series of resets until engineers on Earth were able to regain control of the craft late last week.

"You have to keep in mind that the problem we've had actually is associated with our ability to collect and maintain recorded data (on the rover). So the flash memory where we store this data that would tell us what had happened over the past days actually is part of the problem we are seeing. So we don't have a lot of information," Trosper told reporters at the daily spacecraft status briefing Monday.

"Let me go back to Sol 18 and tell you a little bit about what we think happened -- to try and reconstruct it. As we get more data, I guarantee you that some of these things will change, but let me tell you what we think today," Trosper said, launching into a detailed explanation that begins last Wednesday.

"Sol 18 we had some weather problems at the (Earth communications) station, and about 10 minutes early for the morning antenna pass we lost the signal. It wasn't clear whether that was the result of a spacecraft problem or a station problem.

"We've done some tracking of that, it's still not completely clear, but it's entirely possible that was a spacecraft problem at that time. We believe that was possibly a reset on the spacecraft that would've caused our signal to be lost when...the software would reset and come up and power off all of the loads and put itself into a safe state.

"Due to the reset, we have actually confirmed that the morning activities that we were trying to do that morning did not complete. So if you recall, we were moving the IDD (science arm), getting ready to (use the Rock Abrasion Tool). The IDD, the arm, position is actually in the same position it was on Sol 18 before we attempted to do that move.

This image from last week shows Spirit probing its first target rock, Adirondack. The rover's arm remains extended while controllers try to restore the craft to normal operations. Credit: NASA/JPL
"Some time the morning, early afternoon of Sol 18 (Wednesday) we encountered the problem. That problem, initially, was most likely a reset. We don't understand exactly where that reset came from but we have some ideas. It caused us to get into this belief that the flash system was corrupted in a way that we got into continuous reset loops.

"Then in the afternoon, we actually sent a command sequence to the vehicle with a little bleep in it to tell us that the sequence got there. We sent that sequence and got the bleep with no problems.

"Twenty minutes after that we expected to see a session from the vehicle on the high-gain antenna communicating with us. We had been on the high-gain antenna since Sol 2. We didn't see that communications session. That, in addition to the 10 minute drop out early in the morning, that was one of the early indications that there was something wrong.

"In the afternoon Odyssey pass we did not see any data from the vehicle. The early Sol 19 (Thursday) morning MGS (Mars Global Surveyor) pass, we only saw two minutes of data from vehicle and it wasn't really data from the vehicle -- it was 'the UHF radio was on and nobody was home' kind of data. And then the morning Odyssey pass we received no data.

"On Sol 20 (Friday) in the morning we attempted to command the rover at the nominal uplink rate where it should be if everything is fine, and we received no data. We have pre-loaded communications windows when the rover should attempt to communicate with us and those windows did not execute on the morning of Sol 20.

"One of the things that the vehicle will do if it encounters a system-level fault is change the rate at it will accept commands, and that is for the vehicle's protection as well as for our knowledge. And so in the afternoon we sent a command at a different rate for the vehicle to send us a beep, and we actually got that beep back. The rate we sent it at was a rate that the software would have autonomously put us in if it had some sort of system-level fault. So we knew at that point that there were about four scenarios that would put us at that rate and we started to go down that path of those four scenarios.

"Then we didn't receive data in the overnight UHF passes that night.

"On Sol 21 (Saturday) we were actually trying to establish the same commandability we had the previous day -- we now knew that there was a system-level fault, we didn't know if it was a power issue, if it was a thermal issue, if it was an X-band communications issue. So we sent, essentially, the same command to get a beep on the morning of Sol 21 and we didn't get the beep.

"Then, as we were getting ready to send the next beep command, the vehicle decided to communicate with us in one of its nominal communications windows at which point we got a little bit of data that had very little information in it. In fact, originally we started to decode it and it was from the year 2053 and we thought 'this is not good!' Eventually we found out the data was corrupted, and we were all cheering at that point because there weren't a lot of scenarios that would put us in 2053 on Mars.

"That signal actually dropped out nine minutes or 10 minutes after we got it. And that was at 10 bits per second, so there was very little data and the data we got was corrupted.

"We sent another command to the spacecraft to give us a 30-minute communications session at 120 bits per second. And that command was received and we got the signal on the ground -- we got one frame of data, which told us that it was sending us data. Then it stopped. And that session then ended about 10 minutes early.

"We tried the same thing again and we modified some of the parameters in the command to try and get a different set of data. That different set of data actually gave us a very limited state of the current state of the vehicle -- some channelized telemetry. It told us how many flight software resets happened over the course of those two nights and that's where the big 77 numbers came from, and we realized we had a reset problem, that certain tasks were failing and it was keeping us from doing the communications that we intended to do.

"As a result of that knowledge, we also realized the vehicle may not have shut down because the reset could be associated with the shutdown of the vehicle. So we attempted to shut the vehicle down, and then we send a beep after shutdown to make sure it has shut down. [The rover would not reply with a beep if it was asleep.]

"It's sort of like feast or famine -- we didn't hear from it for a day-and-a-half and then we shut it down and we send a beep and we get the beep, then we shut it down again and send a beep and we get the beep, and then we shut it down again and send a beep and we get the beep. The vehicle was clearly not able to shut itself down and the reset was causing a problem with the shutdown.

"We knew that the power system was struggling, the battery wasn't charged as much as we expected it to be or wanted it to be. So we deleted our overnight UHF passes in case the vehicle decided to do them -- or attempted to. In the same way the reset cycle had caused those commands not to get in and so we got the first Odyssey UHF pass when we had hoped not to hear from the vehicle because we did want it to be asleep and charge the batteries.

"We asked Odyssey and MGS to turn off their radio beacons so (Spirit) didn't use that energy during the night to transmit because we were getting close to entering our low-power mode. Low-power mode is the mode that will safe the vehicle, take the batteries off-line and sit there, basically, and bask in the sun until the voltage gets high enough for the vehicle to wake up.

"So we woke up the morning of Sol 21 (Saturday) on solar array wake up and saw that we had indeed entered low-power mode and the fault protection had worked exactly as designed. In the low-power mode we don't get our morning communications session until about 11 a.m. because that is when the sun is nice and high, the Earth is nice and high (in the sky) and you can get good data rates and transmit.

"And in that we realized that we had this reset problem. Based on just kind of the hunch of our lead software architect, he believed that the problem was probably associated with the mounting of flash and initialization. There is a hardware command that we can send that bypasses the software where we can actually tell the hardware to not allow us to mount flash on initialization. When we the next day actually sent the command to do that, software initialized normally and was behaving like the software that we had always known. It was a fantastic moment.

"Once we got into the mode where we could command the vehicle to get into a software state that we understood, then we were able to collect data. That is the path that we are on right now.

"Right now, our most likely candidate for the issue has been narrowed down a little bit. It is really an issue with the file system in flash. Essentially, the amount of space required in RAM to manage all of the files we have in flash is apparently more than we initially anticipated.

"We have been collecting data and collecting data thanks to (the science team) and we have lots and lots of files on the spacecraft. That's good -- we intended to have lots and lots of files on the spacecraft. This is a new problem that we encountered based on having many files.

"We are currently in a much more specific debugging activity. Today (Monday), we started to dump out some of flash. We are actually loading a script that we get kind of the task trace on the software and identify exactly where the problem was in the code so we can make sure that our hunch is correct.

"Tomorrow, we are might try to access flash and do a little bit of a health check on it. The next day we might try to delete some files to see if our hunch is correct that it's really due to the number of files that we are trying to manage on the flash file system.

"And in parallel we are trying to work a less likely scenario that something happened with the high-gain antenna and the motor control board when we were doing this engineering checkout of the Mini-TES elevation actuator (Wednesday morning). We are still working that as well to make sure that we can get back on the high-gain antenna in a very cautious way.

"In summary, I would like to say that -- as it has always been -- it's humbling to work with a team of such excellent people. I just want to tell you the folks who are working on the details of this problem are the best of the best in the world that we have. Everyday when I come into work, their innovation, their persistence, their talent and their hard work has almost overwhelmed me and certainly humbles me. But that is what has got us where we are today and that's what is going to get us to having a healthy rover on the surface shortly."

Spaceflight Now Plus
Video coverage for subscribers only:

Status quicklook

Check the status center for complete coverage.