In this blog, I will explain the importance of unit metrics for FinOps by describing a realistic scenario in which a unit metric is derived from an app function, and then used to build a workload efficiency KPI.
Unit metrics
Unit metrics are normalized metrics that help you consistently measure something important. It is not something exclusive to IT or cloud, but of course we will talk about it in this context. Unit metrics can be used for a variety of things, and are often confused with a KPI (Key Performance Indicator). A KPI is used to monitor one or more unit metrics. And if your unit metric is not well designed, it is likely that your KPI won’t be of much use. At least not for what you initially intended it for.
In order to measure cloud efficiency in spite of fluctuation in our overall cloud spend, it is important to know if those metrics are following a certain trend in relation to something else. Normally that would be a KPI, based on a unit metric. And that trend is as reliable as your unit metric(s) is.
If unit metrics are so important, then why are they not being used everywhere to measure cloud efficiency? Mostly because complex organizations require unit metrics on the workload level. Even when you have a mature FinOps team, it is not possible for a centralized team to analyze efficiency on every workload. Efficiency KPIs are unique to each team, product line, BU, service, and so on. However, a mature FinOps team can actually have a big impact on driving a data-centric approach leading to unit metrics. And those are fundamental for good KPIs.
The central role of the FinOps team
A FinOps team can stimulate DevOps teams to create their own efficiency metrics and track them. These metrics can be exclusively related to cloud, but they should have a relation to the business. The FinOps team cannot define unit metrics for all workloads but it can require DevOps teams to do so (with some guidance of course), and measure the adoption of these practices and guidelines. The question becomes then: how many teams have created their own efficiency metrics in relation to cloud?
Let me be more prescriptive with an example. The FinOps team issues a guideline to help teams build their own Efficiency KPIs. All the FinOps team should track is how many teams/workloads there are, and how many are tracking their efficiency KPIs. Something like this:
Central Cloud Efficiency Measure
- Goal: Ensure workload efficiency is being tracked across the organisation by the DevOps teams.
- How: each DevOps team reports their workloads to the CCoE (FinOps Team) and defines workload metrics that will be tracked;
- Target: Within 12 months all workloads must have efficiency dashboards/reports (enabled by the FinOps team, but managed and tracked by the DevOps teams)
- FinOps Team Central Measures: # of workloads (with dashboards/reports)
- Metric: % of workloads with dashboards/reports
- Rationale: Ensures the organization is focusing on cloud efficiency, and not cloud cost without cotext.
Mind that the FinOps team will not be responsible for the individual results of each workload, but for driving the adoption of the guidelines and eventually helping DevOps teams to optimize their workloads. By following these guidelines, DevOps teams can understand how workload output is measured against business success. Each workload typically has a small set of major outputs that indicate performance. The DevOps teams can define which unit metrics to use. This metric will be used to understand the efficiency of the workload and or the cost for each business output.
Efficiency KPIs – a practical FinOps example
So far I have said that efficiency KPIs require good unit metrics, and that they can be influenced and tracked by a FinOps team centrally, but not created by them in more complex organizations. But what is a good unit metric for meaningful efficiency KPIs?
To answer, let’s go further on how a team can come to an efficiency KPI. Take for example a logistics company that delivers packages called “Global Post Services”, or GPS in short. We all love a good track and trace app. Basically, the app is based on a database table with a unique ID, the track and trace code (TTC), and all the info about the package. Every time the package moves to a new location it gets scanned, and the update is added to the table.
When the TTC is generated, it is sent to the customer who can use it to query the table through the lovely app and get the latest info about their package. Let’s assume here that once a package is marked as delivered, the information is deleted within 45 days, and the TTC is decommissioned.
A unit metric that relates to business could be how much a TTC costs (cost per TCC). A KPI would then be to keep the cost of the TCC lower than $”X”. The math is simple: total cloud cost of the app, divided by the number of TTCs. If that is lower than $”X” your KPI is good. One can easily fall under the impression that such a KPI gives you any real insight into your Cloud Efficiency, but in reality, it doesn’t.
Let’s put this into numbers: take a KPI target that says 100 TCCs should cost less than $1,06. Now, if in January 100 codes cost $1.05, and in February 100 codes cost you $1.10, at a glance you would think that January was more efficient than February. But what can you do with that? How can you tell your DevOps team is not being efficient? It can very well be that 100% of the increase is generated by customer behavior, which in turn could have been driven by factors that are not controlled by the DevOps team. But if you are part of the team developing that app, you are aware of all the tiny measures and variables that can impact how much your app costs. Let us elaborate on our scenario to clarify that.
Independent FinOps variables
Imagine I send an important document to a notary, and I want to make sure it arrives. I pay a little extra for next-day delivery and a TTC. Since I am sending it to a Notary in a nearby town, my package is likely to arrive within a business day. I have that expectation, so I will check the app maybe twice a day until it is delivered. That is true, unless a full business day goes by and I have no delivery confirmation, then I will be refreshing that app every hour and I am likely to click the “track my package” button 5-10 times. Let’s not forget that depending on how the app is designed, each time the “track my package” button is pressed in the app it could be an actual query that generates costs.
Now if I have purchased 12 bottles of Italian wine from Sicily and want it delivered to The Netherlands, I will get a different expected delivery time (EDT), let’s say 10 days. I am likely to check the code once a day. That is until I get an update. Because if on the first day it moved from Sicily to Milan, in my head it has a chance to arrive in NL within less than 10 days. In that scenario, I am likely to check every morning, afternoon, and evening, anxious for an update.
This not so far-fetching example serves to illustrate that customer behavior is an important variable that only someone close to the workload can identify. If you can say that the change in customer behavior is driving the increase, you can explore how to cope with it. Our company GPS cannot guarantee that external conditions won’t influence the delivery time, but it can assume that updates for packages that have an estimated delivery time longer than 2 days will not have significant updates every hour or 4. The team can then decide to cache results and update things every 12 hours or even once a day, which limits the influence that customer behavior has on the app cost.
Putting it in numbers
If I am the product owner (PO) and I am looking for a more meaningful efficiency metric, I can look at a unit metric that considers the types of TTC (distance, international, size…) and for how long they are active. Then the new unit metric becomes TTC costs per type, per active day.
A TTC type 1 could be the simplest. It is national and is expected to be active for 2 days. If I have 2000 TTCs that were active for 1 day and 2000 that were active for 2 days, then the average active duration of the codes is 1.5 days. This makes a total of 4000 codes and each code was active for an average of 1.5 days.
If the total cost of the solution for the month is $500 and I want to know the cost per code per active duration I must divide the total cost per number of codes and average active days (500/4000/1,5 = 0,0833).
The following month the scenario for TTC type 1 changes a bit. Then we had the same 4000, but 500 codes were active for 1 day, and 3500 were active for 2 days. The average active duration is 1.875 days. This month, our app costs us $700.
Here is the moment of truth for our unit metric “average active duration”. Will it be a valid one? If our KPI based on this unit metric has increased, then the answer is yes.
Our assumption was that customer behavior influences the TTC app cost. And that the longer a TTC is active, the more expensive our app will be. Again, the total cost is divided by the number of codes and average active days (700/4000/1,875 = 0,0933). In this case we are onto something. We have a driving factor that directly relates to our KPI.
What if it was not? What if TTC costs per type, per active day in January was 0,0833 and in February it decreased to 0,08? Then it is not our unit metric, as an increase in active time should increase the cost. We would need to find a more reliable driving factor to be our unit metric, and there is no better place to start looking for it than with your DevOps team.
Unit metrics are singular to their context. Building KPIs around arbitrarily defined unit metrics are a shot in the dark when it comes to cloud efficiency. Doing that and pinning any efficiency decrease on a DevOps team is nothing but unfair.
Concluding
To summarize, we must be conscious about the limitations a centralized FinOps team has, but also about its important influence over cloud efficiency. To increase your cloud efficiency, FinOps teams should work with DevOps teams so they can identify good unit metrics and build meaningful KPIs. Get your DevOps teams involved and engaged with your FinOps practice. They can learn more about their workloads and do their jobs better. I hope this example helps you to show them how!