W23 - A Stability Metric That Accurately Describes Customer Impact
Background
Why I thought of the topic of stability metrics?It was prompted by several recent events that converged into this line of thinking. First, I was preparing Yugong's disaster recovery work; second, improving availability is the team's O2 this year, and I need to assist Zhongze with some planning; third, a course related to data-driven thinking that Jerry recommended some time ago gave me significant insight.
Some basic concepts
We often say a service should be highly available, highly reliable, and highly stable. In a broad sense, one might assume a single success-rate-like metric is sufficient. However, narrowly speaking, availability, reliability, and stability are not equivalent.
Availability.Refers to the proportion of time a system can operate normally during an observed period.Describes whether it can be used.
Reliability.Refers to the probability that a system can run continuously without failure under given time intervals and conditions.Describes whether it is correct.
Stability.Refers to the probability of errors and trends of performance degradation within a runtime cycle under certain load conditions and over continuous operation time. For front-end software, runtime is decentralized and distributed across individual devices. So,stability describes how well the software performs across different device models and network conditions.
Their relationship can be understood like this: poor reliability will to some extent affect availability, but the reverse is not always true. Being available is a prerequisite for being reliable; stability is a further improvement of reliability.
Next I’ll briefly introduce the concept of expectation. This part comes from the aforementioned data-driven thinking course, which used an example of explaining Murphy’s Law with expectation—a very useful application.
The definition of expectation is the probability-weighted average of possible outcomes. The expected value is entirely determined by the probability distribution. Under the law of large numbers, given enough trials, the observed average will converge to the expected value. Thus, expectation is a predictive value made beforehand. Note that what we commonly call the average is a retrospective statistical value. In English they are also distinguished: expected value is 'mean', while average is 'average'.
Existing availability metric calculation schemes
The SLA in software delivery specifications is a set of expectations.
There is a common industry definition for availability: Availability = MTTF / (MTTF + MTTR).
MTTF, Mean Time To Failure. The average time before failure: how long the system typically runs normally before an incident occurs. The longer the MTTF, the more reliable the system.
MTTR, Mean Time To Repair. The average repair time: for repairable systems, the average time from failure occurrence to completion of repair. The shorter the MTTR, the better the system’s maintainability.
Within the company, availability has been defined more concretely. Repair time is defined in detail, and time is calculated by converting incident loss data into time.
The formula is: Availability = (525600 - sum(min{1440 * loss, (t2 - t1)})) / 525600.
525600 represents 525,600 minutes per year. 1440 represents 1,440 minutes per day. loss represents the daily order loss ratio; for non-transactional businesses you can alternatively choose pageview loss or increased complaint volume. t2 - t1 represents the duration of the incident.
Calculation scheme for the X metric
I tried to find a metric that can describe customer impact more accurately. Since I haven't settled on a good name yet, I'll call it the X metric for now.
If we define the expected customer impact as M (Mean), the probability of customer impact occurring as P (Probability), and the severity of customer impact as D (Degree), then M = P * D. For clarity, here I illustrate using a transactional business as an example. In transactional businesses, P = incident duration / annual time, and D = lost orders / annual orders.
Then the X metric can be calculated as follows:
(1 - sqrt((lost orders / annual orders) * (incident duration / annual time))) * 100%
How to put this metric into production still requires some careful thought.
Last updated