What’s Old is New: “Technical Debt” in Resiliency
Posted July 2, 2024 by Kevin Finch
Technical Debt is a concept that’s foreign to a lot of people on the surface, but it’s something that people in business contend with continuously. It’s also a problem I have encountered for as long as I have been working in resiliency, and in a lot of cases people don’t even recognize the phenomena until it’s pointed out to them. Hopefully I’ll be able to teach you a little bit about Technical Debt today, and then I’ll give some specific examples of how it continues to be a problem that maturity programs face. This is the third blog in my “What’s Old Is New” blog series.
“Technical debt is a measure of the amount of duct tape holding your system together, plus the amount of rust that it has accumulated.”
Itzy Sabo, CTO
What is Technical Debt?
There are several definitions of Technical Debt out there, but one thing that most of them have in common is that they define Technical Debt as a form of compromise brought about by a lack of resources.
In a classic example, a programming team might decide to leave out product features and scrimp on Quality testing to make a software release date. (This is often done with the intent of adding those features and fixes in a future patch or product release.) The Technical Debt here is that they have essentially “borrowed time” from future projects, probably adding to the scope of them too, in order to have something delivered sooner. It’s kind of like borrowing against your credit card instead of paying cash — you get it faster, but you pay more in the long run.
I was first exposed to the idea of Technical Debt outside of the realm of programming, and the way that I tend to think of it is from the perspective of a System Administrator. Businesses tend to put off spending on technology (compromise) because they don’t have the time, money, skills, or staff (resources) to implement changes. The debt here is that Systems and technology are constantly evolving, so the longer a company goes without making improvements, the more they will generally have to spend to get things current. It’s almost like companies accrue interest by not applying resources today to keep technology up to date. I saw this kind of debt employed at nearly every company I worked at back when I was a System Administrator, so this is nothing new.
“Left unchecked, technical debt will ensure that the only work that gets done is unplanned work!”
Gene Kim, Bestselling Author and CTO
If technical debt is managed knowledgeably and responsibly, it can be a helpful way for businesses to reduce costs in the short term. One place I worked over 15 years ago routinely “accrued” Technical Debt in the way that they handled their server environment. Rather than replacing servers every three years as their warranties ran out, this company would routinely keep servers active in their production environment that were five to seven years old. They understood that there was an increased risk of hardware failures from using older equipment, so they mitigated this by keeping a tremendous number of spare parts around. As older servers were decommissioned, they were often added to the spare parts inventory rather than being redeployed or recycled. For parts that failed frequently (like Compaq/HP merger-era RAID controllers) we bought extras so we would always have some in stock. In this case, for this company, accruing Technical Debt was a deliberate and prudent business strategy.
Types of Technical Debt
Something I didn’t realize until I started doing research for this post is that there are actually 4 separate classifications of Technical Debt. Technical Debt can either be prudent (like my example above) or reckless. It can also be deliberate (again, like my example above) or it can be inadvertent. Applying those characteristics, then we end up with a sort of quadrant:
Where things get problematic, of course, is when different people from the same organization have conflicting perspectives on the nature of the Technical Debt. Management might think they’re being prudent by keeping an old server running, but their systems administrators might find that same behavior reckless because of reliability concerns. There’s some really fertile ground for thought and conversation here, and honestly doing this research has changed my perspective about the whole Technical Debt concept. Dealing with each one of these is probably worth another whole article.
How Does it Relate to Business Resiliency?
As it relates to business resiliency, however, acquiring Technical Debt is usually on the “Reckless” side of the matrix. Technology always ages, and as it does, it nearly always becomes less reliable. In that sense, acquiring Technical Debt feels a lot like denial and procrastination. Everybody knows that as servers and network infrastructure get older, physical parts are more likely to fail. Everybody knows that the longer things go without having security patches applied, the more likely some vulnerability will be exploited. In the same vein, the older your recovery plans get, the less accurate and useful they will be because of other changes in the environment. Also, the longer you go between Resiliency tests, the more likely it is that your recovery workflow won’t go like you expect (usually because your testing expectations are based on some previous state of your environment, not the current state).
There’s a second aspect of Technical Debt in dealing with resiliency, however, and I think it’s a little less obvious than putting off a server purchase or Resiliency test. (I think it’s also the aspect of Technical Debt that I most often see in existing resiliency programs.) Many companies inadvertently create Technical Debt in their resiliency programs by not keeping their data up to date. Taking a business process- focused approach to business resiliency and following best practices generates a lot of data. Data is gathered at the beginning during a Business Impact Analysis (BIA) that spells out the requirements of the business, including a tolerance for downtime and data loss. Data is gathered in the planning stages, when departments and employees are matched up to those business processes, and recovery plans are created.
Data is also created whenever a test happens or an incident occurs, giving important feedback to the resiliency program about what worked and what can be improved. All of that data needs to be updated from time to time so it remains pertinent and useful, and if the company decides to take on Technical Debt by letting the data age, the data loses its value.
Now, I’m not saying that all data needs to be kept up to date all the time, and that’s unrealistic for most companies anyway. However, Resiliency data does need to be updated regularly and methodically. Some static things should be updated annually, like policies, BIA data, and plan content. Other more dynamic things should be updated more often, like vendor or employee contact information. Letting that data languish creates Technical Debt a lot more quickly than most companies realize, because if that information gets too far out of date it becomes nearly useless and extraordinary measures will need to be taken to bring it back up to date.
Here’s a simple example of this, and I’ve seen this scenario play out at several companies over the past 20 years. Let’s say that it takes you 4 hours a month to keep your employee information up to date in your Business Continuity Management (BCM) software, and you’ve got 1% per month of employee turnover. You can update that data a little bit at a time every week, take a day out of the month and polish it off, or delegate it however you need, but that maintenance needs to get done because people get hired or leave the company every single day. It’s Technical Debt – if you put that task off, your data gets further and further out of date, and that time just stacks up because you “owe” it to keep the system updated. If you let it go all year without updating anything, can you execute your resiliency plans if 12% of your employee information is wrong? Are you going to have a spare week (12 months X 4 hours = 48 hours) at the end of the year to get caught up? Trying to figure out how to pay off that time debt could be painful.
There’s also a third aspect of Technical Debt in resiliency, and that’s the kind that I think most companies find themselves battling with (unless they are in a heavily regulated industry). This third type of Technical Debt is never starting a program in the first place. This is a little harder to reliably place in the quadrant above because some companies have made deliberate decisions to not create a program, but they also think they are being prudent in that decision because they don’t feel the potential risk is worth the potential cost. Some companies don’t even realize they are being reckless by taking the stance that they will just “deal with it” if a large business interruption occurs, and then they go forward without a formal plan on how they’ll respond. (I would contend that companies have a fiduciary duty to their shareholders and the public to at least put some effort in preparing for inevitable cyber-attacks and system outages, but that’s probably a topic for another blog post.)
“Pay off your debt first. Freedom from debt is worth more than any amount you can earn.”
Mark Cuban
Don’t worry, Technical Debt is not an insurmountable problem when it comes to your resiliency program. It can be solved with a little persistence, and consistent effort. However, sometimes you need some expertise from outside your organization to help line up what needs to be done. Sayers is here to help. Our Business Resiliency team has decades of experience in helping companies like yours prioritize work while aligning with best practices. We’re here to help you get the most from the time you spend, and from your program.