By Alex Salkever Through the long spring and then the hot summer of WorldCom's bankruptcy, a nagging question arose repeatedly: What would happen if WorldCom's financial straits forced it to shut down its UUNet subsidiary? That's the WorldCom unit responsible for running its Internet backbone operations, the big data pipes that shuttle unimaginably huge amounts of information around the country and the world every day. The ultimate answer was that UUNet would never be allowed to fail or go dark.
Why? UUNet carries from 13% to 50% of U.S. Internet traffic, depending on who's estimating. Its customers include the Pentagon and the vast majority of the world's largest companies. UUNet is simply too big and economically important to shut down.
This view got some unexpected validation on the morning of Oct. 3 when key UUNet backbone routers -- powerful computers that serve as Net traffic cops -- failed. WorldCom reported that the outages affected only 20% of its customers. But the resulting domino effect as other networks scrambled to reroute their traffic through different data connections snarled the Internet and slowed e-mail movement from coast to coast and around the globe for most of the day. Many Net watchers said it was the worst outage they had seen in recent memory.
BAD TIMING. According to Net traffic monitor Matrix.Net, the problems originated in UUNet data centers in Los Angeles and Washington, D.C., and the resulting backwash was severe enough to render some smaller Internet service providers (ISPs) completely without connectivity as their normal data routes got clogged. Initially, WorldCom blamed the problem on accidental cable cuts. But it later admitted that problems arose from a technician's error in inputing router tables, the numerical codes that direct traffic on the Web.
Regardless of its cause, the whole incident raises some nagging questions about the country's ability to deal with Net catastrophes. Alas, much of the finger-pointing that resulted from the UUNet debacle is rather deserved.
First, let's take a look at the incident's timing. UUNet technicians attempted to upgrade software on key routers during prime working hours. That defies logic. Network software upgrades on critical infrastructure systems are typically run during low usage periods, such as weekends or holidays, to guarantee minimal business interruption if something goes wrong. Information-technology systems generally have more wiggle room to endure partial failures during off-peak hours. Technicians will also have more time to solve problems that could easily arise.
REAL PAIN. UUNet says it's sorry for the inconvenience. Its parent company even posted a prominent notice and apology on its Web site stating, "A preliminary investigation indicates there was a route table issue. We will continue to monitor the network, and we apologize to our customers who were affected."
And UUNet will feel the pain where it hurts: The company will be forced to cough up cash for breaking service agreements by letting its network dip below minimum performance requirements. UUNet's big mistake was not giving customers a chance to voice objections at the timing of this upgrade, something that savvy network operators who contract with UUNet for their Internet connectivity might have headed off.
Of course, network maintenance is a constant undertaking, and warning everyone about every upgrade would be impractical. Alerting them about major upgrades anywhere close to prime business hours, however, would be a logical step. WorldCom says it always warns customers, but several network operators who pass data traffic to WorldCom, and who declined to be named, say they didn't get any advance information in this case.
NOT NIMBLE ENOUGH. Next, let's examine the mantra recited by Sprint, AT&T, and other big Net providers that more than enough capacity exists to cover an outage at any major carrier. Well, yes and no. The Internet surely has enough high-capacity fiber-optic cables running to enough places that it's feasible to reroute network traffic sufficiently to run around any big bottlenecks.
This doesn't work in practice, however, because the companies running the Internet simply aren't very agile. Ask any network engineer who has had to switch his company's data transmissions from one carrier to another, and you'll hear how it's a royal pain, something that can take days or even weeks. The systems linking all these carriers together don't talk to each other very well, resulting in semi-chaos when transferring traffic from one service to another. Besides, ISPs of any size have little incentive to help a customer that wants to leave.
The upshot? Any big outage at a major carrier means the Internet as a whole goes dangerously close to the edge. Over the long haul, the various carriers would do well to get together and plot out more specific temporary handoff plans for these emergency situations to give customers better continuity.
SEAMLESS SWITCHING. That's something AT&T is considering now. According to company spokesperson Dave Johnson, it has asked the folks at its Bell Labs subsidiary to study this or any other means for quickly remedying similar data outages in the future. Other carriers would be wise to take similar steps right now.
The ISPs and data backbone providers should have organized ways to switch customers from malfunctioning systems onto working ones and also ways to seamlessly move traffic to avoid malfunctioning parts of the Internet. The routers themselves are supposed to automatically route around bottlenecks. They eventually did so on Oct. 3 but not without major disruptions that lasted much of the day -- not exactly an indication of success. "The Internet is robust, but if one large carrier starts stammering, it will have an impact on the overall Internet," says AT&T's Johnson.
At a minimum, big data carriers should always tell customers when major software or hardware upgrades are coming in prime hours. Better still, they should make sure customers actually have input into the timing of these potentially hazardous maneuvers, even though that process could be a little sticky. And the carriers should go back to the drawing board and figure out how to hand off customers quickly and easily in case of major network outages.
Those simple steps might help avoid more serious disruptions -- or even minimize a terrorist attack -- down the road.
Editor's Note: After this article was published, WorldCom contacted BusinessWeek Online and provided updated information. According to WorldCom, it conducted a routine software upgrade during its normal maintenance window of 3 a.m. to 6 a.m. However, it says this software upgrade was not connected to the Oct. 3 outage. WorldCom says the outage actually resulted from a technician who was repairing a router that was experiencing problems at approximately 8 a.m. EST and input an incorrect configuration. This was a normal repair activity, called a "break/fix," and this configuration error caused the traffic outages. Salkever is Technology editor for BusinessWeek Online and covers computer security issues weekly in his Security Net column