How did the first real-time embedded system also produce the first timing bug?

It's now 50 years since one of the first real-time operating systems was designed for the first computer using integrated circuits. What happened to this operating system in action is a fascinating tale.

The AGC computer weighed in at 30 kg, ran at 1MHz and only had 74Kb of ROM and 4Kb of RAM.

Despite the somewhat limited performance of the hardware, the AGC real-time operating system was capable of executing up to 8 jobs at a time using cooperative multi-tasking (i.e. each job had to periodically surrender control back to the OS).

The system, both hardware and software, had been extensively tested for years before it would be used. But after just a few hours of being switched on, the system started to issue error messages indicating that a job deadline had been missed. It then rebooted the system … and again … and again. This made the operator of the AGC quite uncomfortable.

Understandably so when you know that 'AGC' stands for 'Apollo Guidance Control' and that, in this case, the operator was Neil Armstrong during the descent of the Apollo 11 landing module.

The errors reported meant the computer was running out of processing capacity (reportedly because Aldrin had decided to leave the docking radar on) and scheduling new radar jobs before the previous ones had finished. The computer finally switched to fail-safe mode (rebooting and chucking away low priority jobs) which saved the mission.

This was 50 years ago and we now have the tools (scheduling, partitioning, WCET, ...) to avoid this kind of problem and many more. But it's always good to have a fail-safe for the ones we don't yet know about … just in case.

Antoine

See also: Tales From the Lunar Module Guidance Computer.

Receive our blog by email