Optus Outage Exposes Australia's Internet Resilience

15 November 2023

Senior Manager, Internet Technology - Asia-Pacific, Internet Society

Categories:

Resilience

A resilient Internet is crucial for a flourishing economy and the advancement of a country. It supports businesses, fuels innovation, and connects communities, forming a backbone for modern society's progress and prosperity.

Network resiliency is much more than just preventing network failures or disaster recovery. It acknowledges the inevitability of such incidents and prioritizes rapid restoration of services by the network operations team and pre-emptive planning and robust system design to mitigate the impact of outages.

In the early hours of 8 November 2023, a widespread service disruption left Optus subscribers, Australia's second-largest telecom provider, without Internet connectivity. This extensive outage affected the fixed broadband and mobile communications of over 10 million individuals and 400,000 businesses and services, including 000 emergency services, hospitals, banks, and public transport services.

Optus’ mobile virtual network operator (MVNO) providers, such as Amaysim and Coles mobile, and Optus mobile users abroad were also affected.

As per Figure 1, services began to be restored around 13:00 AEDT, but complete restoration took several hours, with some customers reporting continued issues well into the evening.

Figure 1 —Internet traffic to Optus network (AS4804). Source: Kentik.

From what we know so far, a minor technical slip-up started a chain reaction that led to a big Internet blackout. The lack of robust, tested resilience protocols exacerbated the prolonged recovery period.

We need to look at this mistake together and learn from it. If we don't, the same kind of problem will happen to someone else as we've seen already in Canada and Italy in the last 18 months.

Read:

Italy’s Internet Outage a Perfect Storm

Rogers Outage: What do we Know After Two Months?

Taking Responsibility

Optus and government officials quickly assessed and communicated that the outage was not due to a cyberattack. However, Optus has yet to release a proper Root Cause Analysis (RCA), a practice we've become accustomed to due to the open policies of companies such as Cloudflare, Fastly, and, in the case of Australia, Telstra and Aussie Broadband. These organizations set a benchmark for transparency by promptly providing thorough causal assessments after service disruptions, an expectation that is now standard in service provision and customer communication.

Australian journalists are trying to fill this void by actively pursuing every available avenue to gather information related to the incident, with multiple news articles highlighting the possible cause of the incident and citing unofficial sources from Optus. As quoted in the Sydney Morning Herald post:

Optus said those updates were sent “following a routine software upgrade”. The Optus source added that the software update occurred at the Singtel Internet Exchange, not on the Optus network.

Once received, the routing information changes then propagated through multiple layers in the Optus network and exceeded “preset safety levels” on key routers that could not handle them, resulting in the routers disconnecting from Optus’ core network “to protect themselves”.

In response to that news article, Jared Mauch (a network architect at Akamai) posted on X (formerly Twitter), highlighting the issues mentioned in the article that are related to the Border Gateway Protocol (BGP) Maximum-Prefix filter.

What is a BGP Maximum-Prefix Filter?

The BGP Maximum-Prefix filter is a safety mechanism to protect a router from being overwhelmed by too many prefixes. When the number of prefixes received from a neighbor exceeds the configured maximum, the filter can trigger a warning and, by default, shut down the BGP session.

This filter helps prevent routing table explosion, which can lead to network outages. It is beneficial in mitigating the effects of misconfigurations or routing instabilities that can propagate large numbers of prefixes unexpectedly.

In all major router platforms, such as Cisco, Juniper, and Arista, when the BGP maximum prefix limit is reached, the default behavior is to tear down the BGP session. The session will remain down until it has been cleared manually. In the case of Cisco routers, it will stay down unless the following command is used to re-establish the session:

clear ip bgp x.x.x.x command

In his thread, Jared provided BGP Maximum-Prefix filter commands for Cisco and Juniper routers.

Cisco example (set restart time):
neighbor 10.4.9.5 maximum-prefix 1000 90 restart 60

Juniper example:
family inet prefix-limit maximum 1000 teardown idle-timeout 60

ALSO for Juniper look at using accepted-prefix-limit instead, even if you filter the routes out they will count
— jared mauch (@jaredmauch) November 14, 2023

The Cisco "restart" or Juniper “teardown idle-timeout” with a time interval feature in the maximum-prefix limit configuration allows for the automatic reconnection of a BGP peering session that was previously shut down after surpassing the set maximum-prefix count. This feature ensures that the session is resumed automatically, eliminating the need for network operators to intervene to re-establish the connection manually. However, it will go down again if the initial reason it went down still exists.

Signs of Lack of Resilience Were Clear to See

If we want to reflect on the impact of this significant Internet outage in Australia, it is crucial to take a holistic view. From a technical perspective, if we look at Optus' resilience profile, several clues of an outage occurring are apparent.

Internet Society embarked on a mission last year to measure the Internet Resilience Index(IRI), which tracks open-source Internet resiliency metrics to support the development of policies and infrastructure to improve Internet resilience at local, regional, and global levels.

Adopting good network operation norms and implementing key technologies such as Resource Public Key Infrastructure (RPKI) shows intent from network operators to make their networks more resilient to network outages.

Implementing RPKI (or lack thereof) in this outage played no role. However, Optus' decreasing RPKI adoption rate is a worrying sign of its overall resilience compared to other Australian networks' RPKI adoption trends.

Advertising routes with invalid Route Origin Authorizations (ROAs) reflects an oversight in committing to a secure routing framework. Not implementing route validation based on RPKI (Route Origin Validation, or ROV) can be seen as a lapse in promoting network security. These are some of the actions recommended in the MANRS initiative to improve routing security, with large Australian operators, including Telstra and Vocus, already participating. For more details, check the MANRS website.

Figure 2 — Optus (AS4804) RPKI adoption. Source: MANRS Observatory.

Another clue was Optus’ choice not to engage in local peering within Australia's dynamic interconnection landscape, where most local operators peer locally. While this was possibly a strategic business decision, it overlooks an opportunity to bolster network resilience through diverse connection points.

Review Must Also Evaluate Market Resilience

Any reflection should evaluate the unintentional effects of market consolidation and centralized control, grappling with the complexities of gauging Internet resilience.

At this point, it’s well-known how deeply Internet technologies are woven into society, from commerce and healthcare to transportation and political systems. Before delving into this reflective analysis, those involved in Internet operations must gain a comprehensive grasp of the details of the Optus incident.

On 9 November 2023, in response to the Optus incident, the Australian Senate referred this matter to the Environment and Communications References Committee for inquiry, and its first public hearing is set for 17 November 2023. More details about that hearing are available here.

As per the submission Internet Society made in response to the Australian Cyber Security Strategy Discussion Paper 2023-2030, we highlighted some important points for routing security mandates. We also want to re-emphasize those points from the network resiliency perspective here.

The Internet ecosystem is rapidly evolving, with new technologies, standards, protocols, platforms, and services constantly emerging. As such, it can be challenging to develop and enforce laws and regulations that keep up with the changing nature of the Internet. While there are proactive actions the Australian Federal Government can take to improve network resiliency, we caution against prescriptive mandates across the board, which could have serious unintended consequences because we firmly believe that norms rather than laws better govern the Internet industry.

Norms are more flexible and adaptable than laws, which can be rigid and slow to change. As the Internet industry evolves, norms can quickly adapt to new technologies and practices. They are often developed collaboratively by stakeholders in the Internet industry, including expert tech groups, civil society organizations, and governments. This collaborative approach can promote shared values and goals, leading to greater stakeholder cooperation and trust. It also encourages innovation by promoting best practices, encouraging experimentation, and fostering a culture of continuous improvement.

In the world of Internet operations, operators need to learn from each other's mistakes. Keeping operational failures secret doesn't help anyone. It only increases the chance that small, hidden problems will become significant. The government could encourage big network providers to share openly about their outages so everyone can learn and improve together.

Tags: