Introduction: You can’t Avoid Tech Debt
As a software engineer, you’ve definitely struggled with technical debt. It’s the accretion of short-term decisions that mean the story gets finished within the sprint, but guarantees a future code re-factoring epic. We might think the choice is to do work quickly now and re-work in the future when the impact of the compromises are clear, or to work more deliberately and methodically now to avoid re-work in the future. Except that this isn’t the choice; contexts change quickly wasting all the methodical work done and necessitating re-work again.
It’s clear that the less wasteful approach is to work quickly and iteratively, to fail fast and learn often. But if we don’t address the debts we accrue, we fail fast and fail often and just plain fail. It’s in the addressing of the right debts in the right ways that we learn and improve.
Technical Debt in ML Applications is Different
Machine learning applications have the distinction of being prone to accrual of common classes software engineering technical debt, as well as some classes of all of its own. Let’s look at some of my favourites.
Visibility Debt: Unexpected, leading to unwanted, coupling of upstream components to downstream components. Often signals an issue in interface design such as poor interface segregation. We try to adhere to SOLID principles to avoid this issue.
Experimental Debt: Experimental code left in production for too long. The end of an experiment is not simply identifying actionable learnings, but to decommission the experimental code (hopefully to replace with productionised code). Experimental code tends to be more “brittle” and therefore maintaining for too long has an unnecessary ops overhead.
Non-determinism Debt: ML applications are seldom strictly deterministic; putting in the same input values twice doesn’t necessarily lead to same output values the second time. Automated testing, e.g. pre-production testing, needs tuning so that lack of reproducibility isn’t the same as the model not working. If the test is too accommodating, the application accrues debt.
Loss Aversion: Humans have a cognitive bias to feel losses more keenly than gains. ML models shouldn’t suffer from loss aversion as long as they are trained on a dataset that considers all cases. If we want our ML application to behave as a human would, we need to include risk sensitivity in the model training. In other words, we need to build in some loss aversion training debt.
Implicit Knowledge Debt: Progress depends on extending and improving, not repeating, previous work. A well-designed ML model understands its data context. A well-designed ML application understand the needs of its users. This contextual knowledge shouldn’t remain implicit in the training of the model or the UX flow of the app.
Explainability Debt: We can live with some level of inefficiency due to accretion of technical debt. What is that number? We will only know if we can quantify the inefficiency, then measure its impact. To do this, we need explainability tools, otherwise we’re guessing.
Of course, there are many more fun ones to look out for. If you’re developing an ML app I encourage you to also stay alert to analysis debt (tendency of reinforcement learning models with poorly understood feedback loops to reinforce their own biases), dependency debt (underutilised or codependent data, libraries, features or even version numbers), hidden logic debt (business logic mistakenly implemented as part of abstractions, symptomatic of integration and data model issues, perhaps the wrong Boundary object), configuration debt (if not expressed as code, configs become hard to maintain), glue code debt (because there are no standards yet for input-output in ML algorithms and models, glue is needed to help ML system components understand each other), domain transfer debt (models trained in one data domain cannot necessarily be used directly in another),
A Solution: Focus on Operational Excellence
There’s no avoiding re-factoring, so any solution to dealing with tech debt should focus on the best way to re-factor without sacrificing velocity. At Massive Analytic, we favour an Operational Excellence (OE) approach, which I like to visualise as how to eat cake whilst on a diet. Every sprint has a thin OE slice, but every so often we give ourselves permission scoff a large OE slice, and feel better for it. Or in actual sprint terminology, we reserve a percentage of story points every sprint for OE work, and the last sprint of each Programme Increment (PI) for OE in entirety.
The other side of this approach is to have a very inclusive definition of OE. All of the following are encouraged in the OE capacity reservation:
Software version upgrades
Proof of concept spikes
Desktop research spikes
Market engagement (esp. with consumers, conferences)
Learning and development
The only things that aren’t part of OE are P1 bug fixes; these are fixed either in out-of-sprint patches or using feature development capacity, depending on service level obligation.
The downside of this approach is that we dedicate quite a lot of capacity to OE. Whilst actual story point allocation is flexible, we aim for 10% of capacity allocated to OE within each sprint and every 6th sprint fully allocated to OE, which sums to 25% of total capacity. There’s still a lot of corporate cognitive dissonance at the realisation that once you take away the effort reserved for planning, testing, documenting and OE, only 50% is left to actually build new things. Moreover and counter-intuitively, if we increase the effort for feature delivery at the expense of other things, we actually reduce velocity.
Conclusion: Debt is good, as Long as you Pay it Back
Let’s finish this blog by turning the original premise on its head. We started from the position that the accrual of tech debt is an unavoidable consequence of software development of any kind. I want to put it to you that measuring tech debt is a good proxy indicator for how well we can develop software.
How many tech debt stories we write (and how well they’re articulated) indicates how well we understand the tech debt we’ve accrued which is a proxy for how well we understand our domain;
If the tech debt story points increase over time, this tells us that we’re unable to keep up with the rate of feature development that we need, so we need to staff up in the areas where the debt is increasing;
Conversely, if the tech debt story points decrease over time, this tells us that we either: don’t understand the domain (or have estimated the debt) well enough and that there’s an unpleasant surprise in our future, or that we have more capacity to do more interesting stuff and can afford to rebalance effort;
The quiescent level of tech debt story points compared to feature development story points tell us how well we’ve developed our domain knowledge and skill.
Measuring how we manage our tech debt turns out to be very revealing of our ability to develop software. Our objective is therefore not to eliminate tech debt, but to monitor and manage it, which turns it from being a burden to being a tool for continuous improvement.
This article owes a large debt (pun intended) to Google’s excellent paper from 2014, which is still relevant almost 10 years later. Highly recommended reading. https://proceedings.neurips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf