Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Network Government United States

How Malformed Packets Caused CenturyLink's 37-Hour, Nationwide Outage (arstechnica.com) 29

Ars Technica reports on what went wrong last December when CenturyLink had a nationwide, 37-hour outage that disrupted 911 service for millions of Americans and prevented completion of at least 886 calls to 911. From the report: Problems began the morning of December 27 when "a switching module in CenturyLink's Denver, Colorado node spontaneously generated four malformed management packets," the FCC report said. CenturyLink and Infinera, the vendor that supplied the node, told the FCC that "they do not know how or why the malformed packets were generated." Malformed packets "are usually discarded immediately due to characteristics that indicate that the packets are invalid," but that didn't happen in this case, the FCC report explained: "In this instance, the malformed packets included fragments of valid network management packets that are typically generated. Each malformed packet shared four attributes that contributed to the outage: 1) a broadcast destination address, meaning that the packet was directed to be sent to all connected devices; 2) a valid header and valid checksum; 3) no expiration time, meaning that the packet would not be dropped for being created too long ago; and 4) a size larger than 64 bytes."

The switching module sent these malformed packets "as network management instructions to a line module," and the packets "were delivered to all connected nodes," the FCC said. Each node that received the packet then "retransmitted the packet to all its connected nodes." The report continued: "Each connected node continued to retransmit the malformed packets across the proprietary management channel to each node with which it connected because the packets appeared valid and did not have an expiration time. This process repeated indefinitely. The exponentially increasing transmittal of malformed packets resulted in a never-ending feedback loop that consumed processing power in the affected nodes, which in turn disrupted the ability of the nodes to maintain internal synchronization. Specifically, instructions to output line modules would lose synchronization when instructions were sent to a pair of line modules, but only one line module actually received the message. Without this internal synchronization, the nodes' capacity to route and transmit data failed. As these nodes failed, the result was multiple outages across CenturyLink's network."
While CenturyLink dispatched network engineers to log in to affected nodes and removed the Denver node that had generated the malformed packets, the outage continued because "the malformed packets continued to replicate and transit the network, generating more packets as they echoed from node to node," the FCC wrote. Just after midnight, at least 20 hours after the problem began, CenturyLink engineers "began instructing nodes to no longer acknowledge the malformed packets." They also "disabled the proprietary management channel, preventing it from further transmitting the malformed packets."

The FCC report said that CenturyLink could have prevented the outage or lessened its negative effects by disabling the system features that were not in use, using stronger filtering to prevent the malformed packets from propagating, and setting up "memory and processor utilization alarms" in its network monitoring.
This discussion has been archived. No new comments can be posted.

How Malformed Packets Caused CenturyLink's 37-Hour, Nationwide Outage

Comments Filter:
  • Sounds almost like an AI trying to come to life.

  • A throwback! [slashdot.org]
  • ... riiiight .... (Score:5, Insightful)

    by pz ( 113803 ) on Monday August 19, 2019 @08:02PM (#59104070) Journal

    ... spontaneously generated four malformed management packets

    Improbable event #1. Not one malformed packet, but FOUR.

    Each malformed packet shared four [sic] attributes that contributed to the outage: 1) a broadcast destination address, meaning that the packet was directed to be sent to all connected devices;

    Improbable event #2.

    2) a valid header and valid checksum;

    Improbable events #3 (valid header) and #4 (valid checksum)

    3) no expiration time, meaning that the packet would not be dropped for being created too long ago;

    Improbable event #5 so highly improbable that it should count as three improbabilities.

    4) a size larger than 64 bytes.

    Improbable event #6.

    Right. They have no idea how these four packets spontaneously happened, packets that appear to have been perfectly designed to exploit a bug that would bring their network to its knees.

    The questions we SHOULD be asking are: what other network of devices uses the same protocol, and have they now been hardened as well?

    • by guruevi ( 827432 )

      From the sound of it, it seems like it was so poorly designed, two bit flips leading into an overflow could've lead to all of this. Parity checks can easily be fooled by an even number of bit flips.

      • they still do not have :

        a valid DKIM signature for their domain (email is often used in emergency)
        allow TLS connections to have client-initiated renegotiation
        have no DNSSEC
        do not implement or consume signed Resources (RPKI) that can lead to outages at a country-level

        that is pretty large failure if you ask me..

    • by chill ( 34294 )

      Infinera. I'm not familiar with them, and they seem to be a relatively new player in telecom -- at least compared to Cisco, Nokia/Alcatel/Lucent, Ericcson, Juniper, etc. They claim fairly wide adoption, though.

      The dedicated management channel is enabled on all equipment from Infinera (good). CenturyLink knew it was there (good). They didn't use it (bad). They didn't disable it (worse). (FCC Report, A-9)

      Despite its name, the management channel is NOT designed to send management instructions. (FCC Report, A-9

    • by AnriL ( 657435 )
      Question zero: what kind of logging does one have in place to be able to go back to your network traffic before a broadcast storm and actually identify 4 random packets that happened on a wire? That would imply they have effectively a continuous wireshark tap on the whole network.
    • Next week on Slashdot: " Disgruntled/Former employee charged with malicious IT attack in CenturyLink incident"

    • by Tree131 ( 643930 )

      ... spontaneously generated four malformed management packets

      Improbable event #1. Not one malformed packet, but FOUR.

      immhackulate conception?

    • My unprofessional and limited understanding of the science here; one theory that is out there for how viruses may be spontaneously created in complex life-forms. Most genetic-based messages in the human body that might change gene expression -- beyond the simple signals of the nervous system, the chemical changes of the endocrine system, and the systemic changes of the immune system. Part of these "packets of genetic message" are the correct keys to get access to cells, a message that says "copy me", and it

  • by msauve ( 701917 ) on Monday August 19, 2019 @09:36PM (#59104282)
    So, it was a broadcast storm. What's with all the verbiage? No indication of how these frames with valid headers and checksums were "malformed." The report as described stinks of obfuscation.
    • So, it was a broadcast storm. What's with all the verbiage? No indication of how these frames with valid headers and checksums were "malformed." The report as described stinks of obfuscation.

      This is just one more episode in the CenturyLink Soap Opera. They're a godawful ISP. If you can imagine it, they're even worse than Cox Communications.

    • The report as described stinks of obfuscation.

      I'm willing to wager that it was written by their billing department.

    • by magister ( 9423 )

      So, it was a broadcast storm. What's with all the verbiage?

      I was thinking exactly the same thing and was wondering if the network engineers were just too dumb and disabled default features to prevent this kind of problem...Then I read the FCC report. It not clear if the equipment that CentryLink uses are actually routers or just glorified switches. Also it describes how they have a intentional network loop setup for 'redundancy' but no indication of any routing or switching protocols running to prevent loops. So its probably a combination of inexperienced network

  • Agent: Have you tried turning them off and on again.

    CenturyLink: Yes.

    Agent: No, really. Turn them all off, then turn them all on again.

    • by pz ( 113803 )

      And the funny thing is that this approach would have worked, and much faster than 37 hours, if they were all off at the same time for a brief period!

    • They spent so much time logging into each node individually and poking around and totally missed this option. It probably would have caused a bigger disruption, but for a much shorter time.

  • After all they've done for us? Quaint.

  • " Malformed packets "are usually discarded immediately due to characteristics that indicate that the packets are invalid,"

    so are they saying that malformed packets have been generated commonly for a long time ? Sounds like somethign worth addressing.

    • It seems a very strange problem now. I had an older Cisco gateway router that could be brought down by malformed packets in the early 2000s, and the solution was to set up an ACL that would drop such packets. Do these guys not have network engineers, and is the hardware they're using not capable of filtering these kinds of packets? It seems like such a 1990s kind of problem.

Happiness is twin floppies.

Working...