street full of people with one person holding an umbrella

Internet Outages: A Case of If, Not When

Robbie Mitchell
Senior Communication and Technology Advisor, Internet Society
Categories:
Resilience
Twitter logo
LinkedIn logo
Facebook logo
December 5, 2023

There’s never a good time to lose Internet connectivity. Most of us can fix day-to-day connectivity issues by turning our router off and back on or turning flight mode on and then off. But, problems closer to the Internet’s core require more technical troubleshooting efforts by the engineers who oversee the affected networks.

Last month, we reported that one-third of Australian Internet and mobile phone users were left without Internet connectivity for several hours due to a minor technical slip-up by the country’s second-largest operator.

Two weeks later, one of the world’s largest Internet Exchange Points (IXPs), Amsterdam Internet Exchange (AMS-IX), also experienced a minor technical fault for several hours, which reduced its traffic by nearly 80%.

Time series graph showing the Internet traffic flowing through AMS-IX Amsterdam's network.
Figure 1 — AMS-IX traffic dropped from its peak average of 10 Tb/s to 2 Tb/s. Source: AMS-IX.

The cause and effect of these outages were very different, given these two networks serve two other purposes.

Peering is Key to Keeping the Internet Going

In the first case, the Internet Service Provider (ISP), Optus, directly serves the Internet and telecommunication needs of around 10 million individual customers and indirectly serves the rest of the Australian population by providing emergency service operations, banking operations, and e-government services.

If we look at Figure 2, the effect this had on the overall Internet connectivity for Australia was relatively minimal (-8%), which is a reflection of both the diversity of retail ISPs in Australia and strong peering culture, which ranks first in the Asia Pacific as per the Pulse Country Reports.

Time series graph showing the Internet connectivity for Australia from 7-9 November 2023
Figure 2 — Internet connectivity in Australia dropped nearly 10% during the outage. Source: IODA.

The connectivity issues experienced by Optus might have been mitigated through increased peering with other local networks and IXPs. But, this remains speculative as we have yet to review a detailed root cause analysis of the incident faced by Optus.

In the second case, AMS-IX doesn’t directly serve individual Internet users, per se, but it does serve many indirectly. IXPs are physical locations where ISPs and Content Delivery Networks (CDNs) exchange traffic with each other, reducing latency and cost. AMS-IX has 16 such sites in the Netherlands, serving around 870 networks. They also have locations in North and South America, Africa, the Middle East, and Asia, so any outage can be felt globally.

Even though AMS-IX is the dominant IXP in the Netherlands, there are others, including NL-IX, which, as per a RIPE Labs analysis (Figure 3), received plenty of the traffic that was rerouted during the outage.

Three time sereis graphs showing routes being rerouted from AMS-IX to other IXPs and other networks.
Figure 3 — Paths to AMS-IX started to be rerouted to non-IXPs (middle chart) and other large and local IXPs such as DE-CIX, LINX, and NL-IX (bottom chart). Source: RIPE Labs.

In the end, the Netherlands (Figure 4), or any other country that AMS-IX has locations in, saw no drop in Internet connectivity due to the outage.

Time series graph showing Internet connectivity for Netherlands.
Figure 4 — Internet connectivity (#Visible /24s) for the Netherlands was not affected during the two timeframes AMS-IX experienced its outage. Source: IODA.

This is how the Internet should work and does work in environments with strong peering ecosystems. It is part of why the Netherlands ranks in the top five most resilient local Internet networks globally as per the Pulse Internet Resilience Index.

Timely Communication Necessary to Reduce Effect

A subplot to these two incidents is how the two organizations have provided feedback on their respective outages.

Optus has yet to release a Root Cause Analysis (RCA) of its outage. Providing such analysis promptly helps other networks globally understand the issue and test their own networks to prevent a similar outage and maintain connectivity for rerouted traffic. This point is not lost on the Australian government, which is investigating the adequacy of Optus’ communications on the day of the outage as part of an ongoing inquiry.

A Root cause analysis (RCA) plays a pivotal role in the Internet community. When an organization chooses not to share its RCA findings, it essentially withholds valuable learning opportunities from other operators. This lack of information exchange implies that others can potentially repeat the same mistakes. Sharing insights and learning is crucial in an ecosystem heavily reliant on interconnectedness. Through this collaborative approach, we can collectively advance and enhance the resilience of the Internet. We can only build a more robust and reliable global network by openly discussing and learning from each other’s experiences and errors.

On the other hand, AMS-IX quickly provided a detailed timeline and analysis of its outage, which was welcomed by the networking community at the recent RIPE 87 conference in Rome last week.

Outages are a fact of life for Internet providers. It’s not a case of if but when. Luckily, plenty of providers and countries as a whole apply best current practices to keep the Internet working when something goes wrong.

Watch Indexing Europe’s Internet Resilience presentation at RIPE 87.