What Just Happened Over at Facebook?

by Hadean Platform

The debacle at Facebook on Monday evening demonstrated the perils of a single infrastructure. Their suite of products went offline for six hours, wiping millions from their valuation. Rather comically, attempts to resolve the issues were slowed as employees were unable to get into the offices due to their security passes being unable to achieve verification. Likewise the internal communication system Facebook employees use, Facebook Workplace, also went down.

Whilst on the extreme end – this type of outage is not unheard of. Facebook released scant details of what exactly went wrong, but essentially their problem lay at the heart of their IT infrastructure. During a fairly routine update, a command affecting a central piece, called the Border Gate Protocall (BGP), resulted in its undoing. The BGP communicates its presence to the networks making up the internet and finds the best route for data to travel. With Facebook’s BGP’s routes down, we essentially had a situation where the entirety of Facebook’s data store was inaccessible. But how can a routine update cause such massive disruption?

Well here we see the dangers of an over-centralised system. Of course, a mistake in an update causing an outage isn’t uncommon, but usually these issues can be resolved quickly. The extreme delay that was so distinct to this situation is where we have to question the make-up of their IT system. We basically had a case of Facebook’s gates breaking, and their tools to fix it were also behind these broken gates. So how can this problem be avoided?

One solution is rather simple – spread your infrastructure across a number of different cloud environments. With a multi-cloud set up like this, a failure in one place is unlikely to cause a complete outage. Or rather, repairing and reconfiguring when a problem does occur is much easier, as it’s unlikely the fault has affected the entirety of your infrastructure. But while multi-cloud is a step in the right direction, it doesn’t provide the highest level of flexibility. 

Moreover, purely cloud environments themselves have come under scrutiny for increasing costs and limitations with things like sensitive data or technologies such as IoT. These are driving a use in on-premise and edge computing, which are better enabled through distributed cloud. Distributed cloud offers the highest level of decentralisation and flexibility. Each facet of your infrastructure can be mapped on to the form of compute it suits most, whether it be edge, on-premise, hybrid cloud etc. Undoubtedly, this requires a significant amount of orchestration. Platforms will therefore need to offer a concise single control point for these various environments to avoid an overly convoluted management system.

Big firms like Facebook require an update to their overly centralised infrastructure to avoid the risk of repeating this rather embarrassing scenario.