Was the AT&T Outage a Change Control Problem?
Posted February 29, 2024 by Kevin Finch
On Thursday February 22nd of 2024, communications carrier AT&T experienced an outage to its cellular network that affected most of the 50 largest cities in the US. The outage lasted several hours. It was reported on CNN that the outage not only impacted AT&T itself, but also eight other carriers that rely on AT&T’s cellular infrastructure.
AT&T claims that the outage was not the result of malicious intent, but rather that it was caused by a software update. They told ABC News in a statement that it was caused by “the application and execution of an incorrect process used as we were expanding our network.”
(As a disclaimer, I don’t work for AT&T, I have never worked for AT&T, and I don’t have any fundamental understanding of their internal business processes. I was a customer for several years but I am no longer. Much of the rest of this essay will be based on my personal experience, inferences, and speculations.)
Based on their statement to the press, it’s clear to me that there were unintended consequences To AT&T’s software update, and that brings me to today’s topic: Change Control, and how it can be a huge component of Business Resiliency.
“…We’re working very hard to see if we can get to the ground truth of exactly what happened…”
John Kirby, Department of Homeland Security, Press Conference on 2/22/24
Let’s talk for a minute about Change Management. Generally, Change Management involves different strategies to help individuals, teams, and organizations better handle changes. It includes methods to adjust or redefine how resources are used, business procedures, budget allocations, and other operations that change the way things are done in an organization. As a subset of that, Change Control is a process that looks specifically at how software (and sometimes hardware) changes are introduced into the environment. (Yes, I know that there is also “Change Control” as a methodology in project management, but that’s not the version of Change Control we are talking about today.)
The Change Control process should be governed by some sort of Change Advisory Board (CAB). That board will have different names and different compositions at different companies, but it is basically “A group that advises and supports the assessment, prioritization, authorization and scheduling of change requests.” The basic idea is that engineers, managers, and subject matter experts meet periodically to scrutinize potential changes to the environment. Most organizations also have a process allowing emergency changes to be implemented. Whether they’re implemented on a normal timeline or an emergency basis, Change Control processes all have the same end goal: to minimize negative impacts when changes are implemented.
I have worked at companies with varying levels of scrutiny on Change Control, and it has been my personal experience that the companies with the least Change Control tend to have the most system outages. In an extreme case, a Kansas-based insurance carrier I did some work for back when I was a System Administrator (20+ years ago), experienced some sort of system outage more than 50% of their working days, and they had a change control process that could almost be described as unmonitored. It was so bad that they left the door to their server room unlocked, and programmers were free to come in and release untested program code into the user environment during the workday.
In the other extreme, I’ve worked for some financial services firms that take Change Control incredibly seriously. They require changes be vetted by multiple departments and managers and that those changes be announced well in advance of implementation. They also have multiple calendar- based “freeze” periods during the year that completely prevent non-emergency changes from coming into the environment. Every change that goes in needs to be signed off by a high-level manager, and needs to have a rollback plan included in case the change creates problems. On the one hand, the Change Control procedures at those companies were restrictive and highly-structured, but on the other hand, outages due to some sort of change to their system environment are practically eliminated.
“Change is inevitable, growth is optional.”
John C. Maxwell
So, while there are many ways to run a Change Control process, it’s been my experience that it’s an important part of keeping vital business systems operational. Given that AT&T said in their statement that there was “execution of an incorrect process” during their software implementation, I think we can infer that they had some gaps in their Change Control procedures. Either a wrong procedure was planned into the change, or the change itself was improperly executed, but in either case the outcome was the same — a large portion of AT&T’s more than 100 million customers (along with customers from 8 other carries) lost service.
The impact this outage will have on AT&T’s reputation will be huge, but I think the financial impacts of this are going to be far reaching as well. It stands to reason that AT&T is going to have to give refunds and/or pay penalties to many of their own customers because of the outage. I think it’s also very likely that AT&T has Service Level Agreements (SLA’s) with all of those other companies that rely on AT&T’s infrastructure, and I doubt that those SLA’s were met in the 8 hours or so that it took to fully recover from the outage; that’s probably going to mean more fines and penalties. The Federal Communications Commission (FCC) will do their own investigation of this, and AT&T will have to answer to them in one way or another. It wouldn’t surprise me if the FCC includes AT&T’s Change Control process as a part of their investigation.
“It takes 20 years to build a reputation and five minutes to ruin it. If you think about that, you’ll do things differently.”
Warren Buffett
We will probably have to wait until AT&T’s next quarterly earnings report to get some idea of how much this outage cost them financially, but it could easily be in the millions of dollars. Could all of this have been prevented by better Change Control processes? In my professional opinion, any process that has the potential to prevent a multi-million-dollar outage along with huge reputational impacts definitely has a place in Business Resiliency.
Confused about Change Control processes and how they work? Interested in learning more about how proper Change Controls can help prevent outages at your business? Sayers is here to help. Our team of experts can help you understand not only the interdependencies in your environment that can be affected by change, we can also help you improve your business processes to make sure that changes are properly vetted before implementation.