Analysis of the Failure
The primary technical causes of the failure are the operand error, occurred in the conversion of the horizontal velocity, and the lack of protection of this conversion, which caused the SRI to shut down. And this is on top of that the SRI operation was not needed at that stage of the flight.
SRI measures the variable of relative heading of the launcher and its movements in space. It has its own computer, which transmits the converted data through the data bus to the OBC, which in turn controls the movements of the solid booster and engine nozzles.
There is a considerable redundancy technique put in use at the equipment level. What that means is that designers equipped rocked with the sets of two identical units in case of safety-critical gear. So, two SRIs operate in parallel, with identical hardware and software. One of them is active, and one is in “hot” stand-by. If the OBC detects that the primary SRI failed, control switches to the other one.
The software of the SRI used in Ariane 5 is almost identical to that used in Ariane 4. Its function concludes certain alignment computations of the rocket’s attitude before launch. Normally they should stop at H0-9 seconds. However, the designers left it running for extra 50 seconds into the flight in the event of the countdown hold-up. In that case the resetting of the SRI would take several hours.
So, the software for the SRI had been ported from the previous generation rocket Ariane 4. And why not? It worked fine. However, what software engineers did not consider, was that the horizontal velocity of the Ariane 5 have been up to five times faster than that of Ariane 4. Nevertheless, the SRI supplier can not be held responsible for that. It was provided with certain specifications – in the event of detected exception, the failure should be reported to the on-board computer, error context should be stored in an EEPROM memory, and the processor was to be stopped – which were successfully met. It was jointly decided not to include in the SRI requirements and specification the actual Ariane 5 trajectory. That error-handling mechanism proved to be fatal.
The board of enquiry that was set up after the incident to research and report their finding on the incident, has made a suggestion. They say that the SRI should have continued to provide its ‘best estimate’ of the craft’s attitude after the software exception has occurred. If the SRI had done this, it should have performed the conversion of the floating point number, if it was too large, chopped the trailing digits, some accuracy would be lost in this process, and sent that number to the on-board computer. This may have meant a small course change, but the alignment function in the SRI is only active for 40 seconds after lift-off. In this case the change may not have been sufficient to cause a huge problem. After the 40 seconds had elapsed, the alignment module would have turned off, and the craft could have continued normally without the necessary course change being reported. In a case like this, where both the primary and secondary units are running the same software, it is paramount that the software is proved to be correct. There is a custom within the Ariane program to only address random hardware faults, hence the addition of a backup SRI. This would have worked very well if it was a hardware that failed in the primary SRI, but it didn’t, it was the software.
As a result of this policy, there was no provision for a software failure. This combined with another custom adopted within the Ariane program, believing that software is correct until proven otherwise.
One of the primary suggestions made by the board of enquiry is to adopt a different attitude to software development. They suggested that for all future projects that the Ariane team undertakes, it should be assumed that all of the software is incorrect, until it is proven to be correct. This recommended methodology is one that I find hard to disagree with, but I can’t believe that the presumably experienced development team would be so naïve as to believe all of the software they were writing had no hidden bugs. There may be some hidden facts that have not been released, or I am not aware of, concerning the development approach. Perhaps this idea was adopted by higher management and forced upon the team, or it may have been enforced due to time and money constraints.
The inquiry board could not find any evidence, that the real flight data were used in numerous tests of SRI, simulator was run instead. As it happened, during testing of the software, seven variables were found to be inconsistent. A protection was added to four of these variables, and three were left unprotected. The reason is that the SRI was rated for a maximum workload of 80%.
Conclusion
We can see that at least five, now obvious, mistakes were made just in one part of the software:
- The ideas of converting a floating point number to the signed integer. Those numbers can be represented in larger numbers than integers.
- The SRI software was never tested with actual flight data. Otherwise overflow would have been caught during testing.
- That function was not even needed at the time. It is used in pre-launch phase.
- That part of the software did not have an exception handler.
- Failure report was sent over the same channel as regular data would go, and in similar bit pattern.
So, to conclude, everything shows that the software engineers are to be blamed. However, was it just incompetence, or management problem, or programming language’s fault, or design error, or testing error? Not really. During development of the software, the analysis revealed that overflow error could not occur. And if it is proved that a condition cannot happen, you are entitled not to test it. So, was the analysis wrong then? It was right. Yes, it was right, but for the Ariane 4 flight trajectory. It is a reuse error (Jezequel, 1996). The SRI unit was reused from a 10-year-old Ariane 4.
Insufficient budget and an easy and tempting way out pushed the designers to take a shortcut and reuse the perfectly working module. The money saved – boom – the money lost.
References
- James Gleick Internet site (1996) “A Bug and a Crash”
web address: .
- Naval Postgraduate Schools (Systems Technology Battle Lab) Internet site (1998) “Software Bug Crashes European Rocket, Ariane 5”
web address: .
- European Space Agency Internet site (1997) “Official Report from the Inquiry Board: Ariane 5, Flight 501 Failure”
web address: .
- Jezequel, J-M. (1997) Institut de Recherche en Informatique et Systemes Aleatoires Internet site “Put it in the contract: The lesson of Ariane”
web address:
- The Radio Amateur Satellite Corporation Internet site (1996) “Ariane 501 Inquiry Board Reports”
web address: .
CNN Inc. Internet site (1996) “Unmanned European rocket explodes on first flight”
web address:
- Mathematics Department, University of Florida Internet site (1996) “Inquiry Board Traces Ariane 5 Failure to Overflow Error”
web address: