Comparing Batch and Streaming Detection of Internet Outages

6 November 2025

Erica Stutz

Guest Author | University of Southern California

Categories:

Resilience

In short

Batch and streaming Internet outage detection methods share similar frameworks and so produce similar results.
Of the 16% of results that aren't similar, only one system is confident enough to report, therefore using both methods can improve visibility.
Batch detection is preferable in highly accurate scenarios, while streaming provides a reliable, near real-time alternative.

Many commercial and academic Internet outage detection systems use Trinocular to evaluate network reliability.

The original Trinocular system operated with batch processing every three months, but in 2016, we deployed a near-real-time Trinocular that streams results, porting new data to our outage website. Both methods are used widely as algorithms that require days of data cannot run in near real time.

Even though the two methods share the same underlying conceptual algorithms, Dr. John Heidemann, Yuri Pradkin, and I recently compared their performance to understand how much difference specific algorithms make.

Our overall result showed that batch and streaming Trinocular agreed more than 84% of the time over an eight-day period.

So, What About the Other 16%?

When we evaluated the instances in which the two systems disagreed, we found:

They produce conflicting results 0.2% of the time, suggesting that streaming is quite reliable but not identical.
In almost all cases of non-agreement (15% of overall time), only one system is confident enough to report, and we trace this difference to long-term algorithms.

We selected two representative network events from the complete set of observed outages during our study period, each affecting more than 20 /24 IP blocks.

Figure 1 shows the key differences between batch and streaming detection. The top-left and middle panels show the outages as detected by the batch and streaming systems, respectively. Each horizontal line represents a /24 block, with colored segments indicating periods of unreachability and white indicating reachability.

Five scatter plots showing outage detection results using batch and streaming methods — Figure 1 — Visualizing differences between batch and streaming outage detection.

The first example, marked in pink (br-1 and br-2), shows two 5-hour outages separated by a 5.5-hour period of reachability. These events began on 2 March 2021 at 7:00 and 18:00 in the G7 Telecom Ltd (AS263015) network in Bahia, Brazil, affecting 23 /24 blocks across five /16 prefixes.

The second example, in green (kr), shows a 4-hour outage starting at 8:00 on 2 March 2021 in the LG POWERCOMM (AS17858) network in Seoul, South Korea, affecting 27 /24 blocks across five /16 prefixes.

While the top panels appear similar overall, streaming (top-middle graph) detects several extended outages that are absent in batch, highlighted as horizontal lines labeled (long-1).

The bottom panels emphasize differences: the bottom-left highlights outages seen by batch but missed by streaming, while the bottom-right shows the reverse. In both, colored segments reflect discrepancies between the two approaches. Notably:

Batch outages tend to start earlier than those in streaming, the Korean outage lasts longer in streaming than in batch, and a brief outage in Brazil around hour 38, labeled (br-3), is detected only by streaming.
The long outages labeled (long-1) are exclusive to streaming but do not appear in the batch-up/streaming-down comparison, indicating they were never detected by batch. These differences result from algorithmic trade-offs inherent in batch and streaming detection system design.

Overall, streaming tends to slightly overreport outages, especially in cases where batch confirms reachability or where available information is limited. Consequently, batch detection is preferable in highly accurate scenarios, while streaming provides a reliable, near real-time alternative.

Our findings underscore the importance of validating independent implementations, even when they share the same underlying conceptual algorithms, to ensure robust and trustworthy outage detection.

Erica Stutz began this work as an undergraduate at Swarthmore College while collaborating remotely with the University of Southern California. She is now pursuing her PhD in Computational Biology and Biomedical Informatics at Yale University.

Contributors: John Heidemann and Yuri Pradkin.

The views expressed by the authors of this blog are their own and do not necessarily reflect the views of the Internet Society.

Photo by NaJina McEnany via Wikimedia Commons

Tags: