The UK ICT industry has loomed large in newspapers this last week, sadly for the wrong reasons. On Thursday 7th the O2 mobile network suffered a major outage. For most of the day some 32 million subscribers were without service as a major part of the network stopped operating. The outage affected MVNO subsidiaries too including GiffGaff, Sky Mobile, Lyca and Tesco Mobile as well as impacting some emergency services and the train platform information services.
As a network failure it represents one of the most significant loss of service events in UK mobile history. And it reiterated to wider society the strategic importance of the UK’s telephony networks’ for daily life, entertainment and business utility. Small businesses reliant on credit cards mediated over phone apps may have lost a whole days takings and many activities on Thursday 7th couldn’t elegantly fall back to Wi-Fi, not least in car navigation.
Initial investigation published in the press suggests that the root cause of the failure was the unexpected expiry of hard coded SW certificates in the SGSN-MME. These nodes were legacy SW that was in the process of leaving the network*. O2 and Ericsson both have hardworking, capable, smart and diligent engineers. Both companies have an absolute tradition of engineering quality, always aiming to do the best job for the customers and a great history of service.
In truth it is too easy to point to an expired SW certificate, talk about remedies through contracts and ignore the wider systematic failure mode underlying this event: modern networks are incredibly sophisticated, contain a multitude of networking nodes from a plethora of suppliers. No single network architect can understand the full SW stack in detail. Furthermore all actors in the industry are under commercial constraints, resources are limited. It could have been an outage on any of the UK’s networks’ that Thursday, and any of more than 20 SW and HW suppliers could have expiring SW certificates. And we should not skirt over the fundamental truth that 5G and IOT only make our network more complex, with more nodes, and designed to encourage interwork across an even larger pool of industry actors.
Things aren’t going to get easier!
As industry evolves and becomes ever more important to it’s customers, risk management and risk control and seeking network robustness may well become a defining value differentiator amongst service providers. As Volvo so ably proved, it built safer cars so it created an enduring market amongst it’s customers who ultimately were prepared to pay a premium for that feature. We can also learn from the fashion apparel industry. Not more than 5 years ago there must have been senior executives who thought it was impossible to make a football and prove no child labour had been involved. Industries have cleaned up their supply chains and put clear controls in place. The chemical industry is another case in point where single points of failure are engineered out of processes.
In summary, our mobile networks are increasing in complexity and it is that very complexity that brings risk. However we have a choice. We can manage that risk and take clear proof points of success from other industries. Last weeks outage gives us all a mandate to review our networks and how we manage the risk of the networks. Fore warned is fore armed. We hope 2019 is the year when we begin to address this matter systematically. After all our customers depend on us.
Azenby is a forward leaning consultancy. Our specialism is helping mobile operators. We have implemented a number of strategic projects for leading MNOs around the world helping them find ways to increase revenues as well as reducing the cost of doing business. Our work has included in depth technical due diligences and risk management. Why not get in touch with us to learn more?
*NOTE 21/12/2018: please be aware the phrase “These nodes were legacy SW that was in the process of leaving the network” was incorrect. We are subsequently informed that the SW in question was an in-life version, fully within support and maintenance contracts, and was not in the process of leaving the network. Our apologies for the inaccuracy.