Guru Kini

Feb 20, 20238 min read

Chasing the wrong Engineering Metrics

Going fast, downhill. Image by Tim Foster.

Organizations have moved from the short-term output based view to a medium-term outcome one. Measuring outcome and tracing the causality to software engineering is incredibly hard, but that doesn't mean we should keep believing that the age-old metrics are still relevant.

In this post, we will go through some metrics which are often used and which have little bearing on the outcome. In fact, chasing these metrics may distract the team, create busywork, and lull everyone into a false sense of productivity.

Why are these metrics still being used then? Because they are easy to measure and easy to improve upon!

Focusing on LOC

“Measuring software productivity by lines of code is like measuring progress on an airplane by how much it weighs.” - Bill Gates

Let us start with an easy one. Lines of Code (LOC) stopped being a sensible measure some three decades ago and no longer used as a metric of productivity (one hopes). LOC and derived calculation techniques like COCOMO no longer fit into the current software development lifecycle that focuses on delivering incremental value to customers frequently.

Today any non-trivial software product could have various components written in different languages with different programming frameworks. Open-source and third-party libraries power a significant portion of the logic. Using LOC as a proxy for productivity and value is just misleading.

In fact, the only LOC number engineering leaders should worry about is zero. If a team has no output for the last several weeks, there may be something that's blocking them.

Instead, try this...

Track Code Churn and Technical Debt.

Code Churn is somewhat hard to measure and on top of that it seems to have several definitions. The simplest one we like is the amount of code that has been frequently rewritten in a short span of time (say 2-3 weeks). High code-churn could indicate poor requirements management, poor quality, or both.

Code Hotspots are related to churn. These are modules/services which are the source of an inordinate amount of changes as compared to other modules/services. This may be the 20% of the code responsible for 80% of the defects. Finding out the root cause may lead to some significant improvements in the team efficiency.

Explore Adam Tornhill's Code as a Crime Scene tool for visualizing Code Hotspots. His associated research may be very helpful for engineering managers and senior engineers.

Tech Debt, like Code Churn, is hard to measure accurately but thankfully tools like SonarQube do give a good enough approximation. And just like Code Churn, not all Tech Debt is bad. Not every piece of recognized Tech Debt needs to be fixed immediately. As long as the trend shows an overall reduction, the team is handling the debt well enough!

Increasing Code "throughput"

At first this seems like a sensible goal. After all, every Pull Request (PR) that makes it into production does adds some incremental value. So the more PRs we have, the better.

However, optimizing this metric comes with some caveats:

Not all PRs are same. A PR with a 2-line change could fix an annoying bug that could stop hundreds of unhappy users from unsubscribing. A 2K LOC PR may be a feature that is used by less than 1% of users once a month. So going just by the PR count may be plain misleading.
Creating individual goals to increase PRs per week or month will only lead to resentment and gaming. Why would one not split their PR into 2 (or more) PRs and score more on this metric? There would be a perceived increase in the output but a net reduction - since a bit more time is needed in the PR overheads (including extra review cycles).
Releasing it to the users is the start of the user's journey; and the metrics for the latter often do not even feed back into the engineering metrics. In the mad rush to release as many PRs as possible, the team often has to compromise on quality or introduce new technical debt.

Like with LOC, the only value that leaders should worry about is zero. It may be worth understanding the reason why a team has not merged any PRs into production for several weeks.

The only value for PR throughput that the leaders should worry about is zero per unit time.

Instead, try this...

Focus on reducing Deployment Time.

DORA Metrics are widely regarded as leading indicators and Lead Time To Change (LTTC) often gives the true pulse of how efficiently a code change reaches the production. In other words: How often do you provide value to your users.

LTTC can be broken down into time needed for coding, reviewing, testing, and deploying. The Deployment Time then is the time it takes for a code change (viz., a PR) to be deployed after it is reviewed and ready to go. These days, there is no reason for the Deployment Time to be in days or even several hours. This part can be compressed to minutes with some minimal investment in CI/CD automation. This investment will save so much time and effort that it will pay itself off in a few weeks.

Ten years ago, automating may have been daunting for many organizations. But today the tools have been democratized and the costs are constantly coming down, so there is no real reason to avoid this any more!

Often people point out that a PR is ready to go but it is not "urgent" to release, so it just languishes in some branch biding its time. The counterpoint would be: "Why were developer cycles spent on it then?" - it sounds like a planning and prioritization issue that needs correction.

Increasing Velocity

The web is full of very polarized views on story points and velocity. Whether to use story points or not is separate debate but what's important is not to over-focus on it as a productivity metric. Using story points to size a Sprint may make sense for a mature team, but keeping a story point "target" across teams makes no sense at all.

"Increasing Team Velocity" is often misaligned goal which may lead to inflated story sizes or leading to one person (or a couple of people) deciding what the size of a story should be. Moreover, an increased velocity doesn't necessarily translate to better outcomes. It doesn't really guarantee more customer value is generated. Story points are a measure of how much work the team can finish in a given timeline (viz., a Sprint) before they start working, it should not be used as a measure of productivity after the work is done.

Instead, try this...

Focus on how many Sprints have been overloaded or underloaded. Look for:

What is the trend of spill-over stories and tasks from Sprints?
How often do teams finish their Sprint stories/tasks well before the end?

This trend analysis may indicate if certain teams need help with their sizing or planning. It is important to note that the trend matters more than just one or two Sprints.

Reducing Defects

The general thinking is the fewer defects reported, the better the quality is. However, by mandating a blanket reduction in defects creates perverse incentives for the team. On the contrary, everyone in the team should be encouraged to log defects whenever they spot something wrong. Number of defects reported should neither be penalized nor rewarded.

Quality should be owned by everyone - implementing an Andon-style of highlighting problems should be encouraged.

In organizations that have separate development and QA teams, there is often a tension between the two. Rewarding the QA team for the number of defects raised can lead to unnecessary defects. The development team would spend time pushing back on defects because it "makes them look bad". Detecting and diffusing this tension is very important for an engineering leader. Defect counting is a step in the wrong direction.

Everyone is responsible for quality.

The metric that counts is how many defects reached production (and how many users were impacted). These Escaped Defects are not only more expensive to fix, they can erode customer satisfaction very quickly. Once again, just counting the number of escaped defects is not sufficient.

Instead, try this...

Track the trends for escaped defects.
For every significant "escaped defect", do a RCA. The significance can be defined based on how many users it impacted, how much monetary loss (real or perceived) it may have caused, if it caused a security incident, if it was a repeated error, etc.
If the defect could really be avoided upstream, tag it accordingly with the reason.

A simple report based on tags would give a clear picture of what are most common types of avoidable defects and which services/modules are most prone to such defects. This could help identify what preventative actions should be taken.

Some tips:

The defect fix doesn't have to wait for the RCA (and tagging). The RCA is to identify preventative actions in the future, not to hold up the immediate corrective action (i.e., the fix).
Try to limit the tags to a reasonably small, well-defined, and well-understood set. Going overboard with too many fine-grained tags may make it hard to do a trend analysis. At a minimum, it is good to understand which stage was this defect introduced (requirements, implementation, testing, deployment, etc.).
Alternatively, use a custom field to flag the "Escaped" defects and another to indicate the stage it was introduced. Tags can be then used for more fine-grained categorization.

Increasing Deployment Frequency

Deployment Frequency (DF) is a well-established DORA Metric, usually considered as a leading indicator of how quickly an organization delivers value to the user. More deployments means more value. More value translates to more customer loyalty, less customer churn, and so on. DF is a terrific metric to use in conjunction with other metrics.

However, focusing on just increasing this one metric in isolation could lead to other challenges. DORA research indicates that the high-functioning organizations deploy multiple time in a day. But "multiple deployments per day" may not be your organization's requirement at all. Increasing DF per team is meaningless by itself.

Deployments are not always evenly spread across. A team may have to spend the first half of a Sprint or Release doing prep work, and the deployments will be fewer. That doesn't mean they should be rushed through the prep period - that's every bit as important to improve quality.

"Each team should deploy to production more times per day" is not the desired goal.

The right interpretation of the metric should be: "Teams can deploy several times in a day, when needed". For this, the teams should have the benefit of the right processes, service-level abstraction, tooling, CI/CD automation, etc. Teams should be able to deploy without waiting for a go-ahead from other teams or going through lengthy change request approval cycles.

Instead, try this...

Focus on the Deployment Time.

As discussed above, this is the only component of the LTTC metric that can be objectively improved without compromising quality.

Improving Deployment Frequency without investing in improving Deployment Time is not really possible. More deployments would mean that the deployment overheads will also increase. The goal then should be to lower these overheads down.

More importantly, the organization structure should enable teams to deploy independently of each other. For a complex product, a loosely couple set of services owned by independent teams makes the most robust configuration. It may need investment in the architecture, team composition, CI/CD automation, etc. to make that possible.

In conclusion

These are just a few misinterpreted metrics often used in organizations because the alternatives are much harder to define or measure. The truth is no single metric can really capture productivity (see "Rethinking Productivity in Software Engineering"). And almost always improving one single metric comes with the trade-off of sacrificing something else. The alternatives suggested here are non-trivial and need the engineering leadership to act.

Engineering leaders today have to deal with several sources to get information: project management tools, source code control, APM tools, logs, customer reviews, etc. And this list is constantly growing. It is getting harder to find time or gain expertise to navigate all of them.

Our team at Praximax is tackling this exact problem. By gathering signals from these various sources, our solution bubbles up anomalies by connecting the dots. As an engineering leader, you should need to know where to focus your attention right now, not waste time diving into a dozen tools to piece together what you should be doing.

Any real improvements have to be brought about by you, not by a tool. Praximax just helps you get there faster. A lot faster.

Chasing the wrong Engineering Metrics

Focusing on LOC

Instead, try this...

Increasing Code "throughput"

Instead, try this...

Increasing Velocity

Instead, try this...

Reducing Defects

Instead, try this...

Some tips:

Increasing Deployment Frequency

Instead, try this...

In conclusion

Comments