Digits and junk

"It's theirs": what really happens when we say the internet is down?

A look back at the worst digital service disruptions of 2025, a particularly disastrous year that highlights shortcomings in the network of networks.

19/12/2025

BarcelonaThis year, 2025, has repeatedly reminded us that we live in an internet-dependent society. Digital networks and services have experienced some of the most massive outages in recent history, affecting hundreds of millions of users worldwide. Worryingly, many of these outages were caused by human error, poorly deployed updates, or faulty configurations. Where has the distributed internet, which was supposed to be resilient, gone?

The most serious incident occurred on October 20th, when Amazon Web Services (AWS) generated more than 17 million outage reports to DownDetector. But this figure only includes users so frustrated that they actually went to a third-party website and clicked the red button. Industry estimates suggest that for every person who actively complains, there are between 20 and 100 who simply remain silent and restart their router. With a conservative extrapolation, we're talking about 850 million people affected: one in five internet users worldwide was left hanging due to an Amazon error.

Cargando
No hay anuncios

The outage, which lasted more than 15 hours, originated in the automated DNS management system linked to the DynamoDB database in the US-EAST-1 region. This is the internet's messy storage room, located in Virginia, where layers of obsolete technology accumulate. A single point of failure that all system architects know must be avoided, but which everyone uses because it's where Amazon first deploys new features.

The problem was a technical blunder of unbelievable stupidity. An automated update caused two processes to attempt to write to the DNS system simultaneously. Instead of prioritizing one process before the other, the system became confused and decided to delete the paths. The result: the database was functioning perfectly, but nobody knew how to access it. It was as if someone had erased the AP-7 highway from the GPS; the highway is there, but cars can't find it.

Cargando
No hay anuncios

The ultimate irony is that AWS's own dashboards also relied on this DNS. When Amazon engineers tried to access the system to fix the problem, they couldn't. Among the affected users, those with Eight Sleep smart beds couldn't activate them: some people need AWS to sleep.

The second most notorious incident was the PlayStation Network outage on February 7, with nearly 4 million complaints. Its 116 million monthly users were unable to play for 24 hours, in the second-longest outage in PSN history since 2011. Most frustratingly, it coincided with the launch of the Monster Hunter Wilds beta. The PSN outage is a brutal reminder: we don't own our games. When you buy a game for 70 euros in the digital store, you buy the right to play as long as Sony wants and is able to keep the server running. Even single-player games would crash if they had to connect to validate the license.

Cargando
No hay anuncios

Cloudflare, a company that specializes in protecting the internet from outages, experienced significant disruptions. The one on November 18 affected Spotify, ChatGPT, and Discord for almost five hours. The reality was prosaic: an engineer applied an update to the database that manages bot detection. The change caused an internal query to return duplicate data, which made a configuration file grow beyond the limit the software could read. When Cloudflare's thousands of servers received this excessively large file, the software panicked, and the servers entered an infinite restart loop. CEO Matthew Prince admitted it was "the worst incident since 2019."

Here, too, the year has been tough. On April 28, Spain and Portugal experienced the largest power outage in their recent history. Internet traffic plummeted by 80 to 90 percent for more than 36 hours. Mobile networks shut down as backup batteries ran out. The Spanish economy suffered estimated losses of €1.6 billion. Three weeks later, on May 20, Spain was once again partially disconnected due to a failed Telefónica network upgrade. Madrid, Barcelona, ​​Valencia, Seville, and Bilbao reported massive outages. The 112 emergency number stopped working in many autonomous communities. "All services were restored, except for a couple," Telefónica's chief operating officer later said, with impressive nonchalance. Why do falls affect so many people?

The answer is as simple as it is worrying: the internet is far more centralized than we'd like to believe. Companies like AWS, Cloudflare, Microsoft Azure, and Google Cloud dominate the market. When one of them goes down, it drags down thousands of applications that depend on it. AWS holds approximately 32% of the global cloud computing market. When its service goes down, platforms like Netflix, Spotify, and Roblox become inaccessible. The October incident affected Delta, preventing passengers from checking in. Cloudflare, for its part, provides services to millions of websites. When its systems fail, completely unrelated websites disappear simultaneously. To detect these outages, several distributed monitoring systems are combined. Platforms like ThousandEyes and Catchpoint use thousands of global monitoring points that analyze billions of measurements daily using protocols such as BGP (Border Gateway Protocol) and DNS. When anomalous changes occur in BGP routes, the systems can detect outages in a matter of minutes. DownDetector, owned by Ookla, takes a different approach: it aggregates notifications from affected users. It's less technically precise, but very effective at gauging the real impact. When a massive outage is detected, a race against time begins. Engineers must first identify the cause of the problem in immensely complex systems. Modern enterprises use systems to revert the most recent changes. It took Cloudflare more than five hours because the corrected configuration had to be propagated across all its global data centers. In the event of power failures, recovery is slower and more physical. Operators must restore node by node, antenna by antenna. Backup batteries provide approximately eight hours of autonomy, but this falls short during prolonged outages. The lessons of 2025

This year has taught us that perhaps we should reconsider our blind faith in the cloud. US Senator Elizabeth Warren summed it up after the AWS incident: "If one company can break the entire internet, it's too big. Period." We've also learned that human error is inevitable, but recovery systems are far too slow. When a misconfiguration can leave millions of users without service for hours, we need to rethink how we deploy updates to critical infrastructure.

Cargando
No hay anuncios

We've discovered that the promise of a distributed internet architecture is more of a marketing slogan than a reality. Three or four companies control the essential infrastructure of the global network. Automation, which was sold to us as the solution to human error, has become an error amplifier that propagates failures at the speed of light through thousands of servers before any human can say, "Hey, wall machines."

2025 isn't over yet, but it has already been eloquent enough. The internet is incredibly useful when it works, but catastrophically useless when it doesn't. And it's becoming increasingly controlled by fewer and fewer people. How convenient.