SONET Ring is Culprit in Air Tower’s 2-Hr. Blackout Published Oct. 11, 2007 (Vol. 28, No. 20) Nearly two hours without phone service ... 200 planes stranded in the air ... More than 100 flights grounded, delayed or diverted in a dozen cities ... Before the telephone, radio and radar blackout at the Federal Aviation Authority’s Memphis air traffic control center on Sept. 25, the FAA and just about any communications technology professional who counts on uninterrupted telecom service would have confidently told you the same thing: A SONET ring is a solid, dependable way of making sure your network maintains redundancy in a disaster. Now, more than a few telecom managers aren’t so sure. AT&T officials have thus far been cryptic about what happened in Memphis late last month. E-mailed statements to Voice Report confirm only that an OC-192 network circuit – equivalent to 5,376 DS-1s – was indeed down between 11:38 a.m. and 1:30 p.m. central time, and “equipment failure” was to blame. But Voice Report interviews with officials at the FAA and the air traffic control union, as well as several telecom experts, point to a bigger concern: The failure occurred on a SONET ring, a service purchased by enterprises for the very purpose of avoiding such critical outages, and was compounded by provisioning mistakes. Read on to determine the steps you need to take to make sure your enterprise doesn’t confront a similar disaster. Faulty MUX Card Blamed for Initial Failure... It makes sense that the FAA would purchase a SONET ring from BellSouth – now part of AT&T – for its Memphis air traffic control center. The tower handled almost 3 million flights last year, ranking it ninth among the 20 centers in the continental U.S. responsible for communicating with planes above 18,000 feet. Several thousand pilots flying over portions of seven states rely on the center daily to keep them from crashing into each other. [See chart demonstrating configuration of BellSouth SONET ring.] So, when telephone, radio and three out of the Memphis center’s 13 radar stations went dark Sept. 25, it wasn’t just all flights within 250 miles of Memphis that were shut down. At the Dallas-Fort Worth International Airport, for example, 50 flights reportedly were delayed. In a desperate attempt to ensure the safety of planes already in the air, controllers at the Memphis facility breached FAA policy by using personal cell phones to communicate flight paths and radio frequencies to other centers, recounts Doug Church, spokesman for the National Air Traffic Controllers Association (NATCA). “It kind of goes without saying that controllers are going to do whatever they have to do to protect safety,” Church says. “They’ll move heaven and hell to do it [even] if it means breaking a policy.” A communication blackout of unprecedented proportions is how Patrick Forrey, NATCA’s president, described the incident during a hearing by the U.S. House of Representatives’ Subcommittee on Aviation – coincidentally scheduled the next day to discuss the recent epidemic of flight delays. “We have never had an outage involving this much airspace for this long a period of time,” Forrey told the subcommittee, reveal transcripts of the hearing. Earlier during the hearing, Rep. Steve Cohen (D-Tenn.) had surprised Acting FAA Administrator Robert Sturgell with questions about the outage. “Does this incident, Mr. Administrator, indicate to you that there is a need for more backup systems or more security?” Cohen asked. “This was not a security problem, but do we have security at the telephone facilities that, if they were struck, could destroy our capacity to have an air transportation system?” Sturgell’s answer: “We are still investigating, but at this point it is a Bell South/AT&T problem, and of course we will be, you know, discussing this with them, as we have been since it occurred, to figure out what the problem was and whether our system should be routed differently at this location and at other places to ensure more redundancy or better reliability.” FAA spokeswoman Laura Brown, reached at home while walking her dogs, gave Voice Report more details a few days later. It was the removal of a “corrupted card,” apparently referring to a MUX interface card, that caused the network to fail, Brown says. “We have backups and backups and backups and backups,” she says, describing the multiple connections the air traffic control center has to the ring. “We are dependent on the redundancies that are supposed to be inherent in a SONET ring.” Because SONET rings are made of fiber optic lines with more bandwidth than any one customer could use individually, multiplexers (MUXes) are needed to break the bandwidth down into channels that go to enterprises, explains Gary Audin, a telecom consultant who has provided carriers with SONET ring management advice. A MUX will have one interface card connecting to the SONET ring on the “back side” and several interface cards connecting enterprises to the ring on the “front side,” says Audin, president of Delphi Inc., in Arlington, Va. Chris Lee, managing director for Fairfax, Va.-based Source Loop and a former MCI network engineer, says roughly 10 card failures occur yearly across the various networks he helps his more than 50 clients manage. Power surges, manufacturer defects and normal wear and tear are all typical causes for card failures, Audin adds. ...But Bad Provisioning Also Likely at Fault But the failure of a single MUX card still doesn’t explain why communication was not restored on the SONET ring’s backup channel. That’s where human error played a major role, suggests Ron Carpenter, president of the NATCA’s Memphis branch. Carpenter says AT&T has been less than forthcoming with efforts by air traffic control officials to investigate the outage, describing the circumstances that led to the Sept. 25 blackout as “proprietary.” But in the carrier’s silence, Carpenter says he and others have determined that Melbourne, Fla.-based Harris Corp., acting as the FAA’s provisioning agent, ordered the backup channel to run on the same line as the primary channel, leaving the network without a failover when the interface card was removed. A Harris Corp. spokesman, contacted earlier this week, said he was unable to respond in time to Voice Report’s inquiries. The publicly traded systems integrator boasts on its Web site of signing a 15-year, $3.5 billion contract with the FAA in 2002 to modernize telecom infrastructure at 5,000 FAA facilities. It wouldn’t be the first time such a mistake was made, says telecom consultant Gary Audin, president of Delphi Inc. in Arlington, Va. A Wall Street client of Audin’s, located in the World Trade Center, found itself in just such a position years ago, he recounts. The enterprise used its SONET for three years without a hiccup, but found out during a primary channel outage that the backup channel wasn’t wired correctly, though the installation tech had signed off on it. It’s also possible that a MUX on the SONET ring used a dual-interface card, which connects to both the primary and backup channels, Audin says. Pulling the dual-interface card would have disrupted connectivity on both channels, defeating the redundant purpose of a SONET. Dual-interface cards are used to save space and money, he notes. Are other enterprises in danger of sharing the Memphis air traffic control center’s recent experience? AT&T says it has “initiated a comprehensive investigation” and “worked directly with its equipment vendor to evaluate similar equipment platforms throughout the AT&T network to ensure the highest levels of reliability, and to develop a software update for the equipment.” Following testing, the carrier says it will install its update “in relevant platforms” throughout the network. “These actions will help to ensure that a similar equipment failure does not occur in the future,” AT&T says. ( |