W23 - A Stability Metric That Accurately Describes Customer Impact

Background

Why I thought of the topic of stability metrics?It was prompted by several recent events that together led to this line of thought. First, I was preparing disaster recovery work for Yugong; second, improving availability is this year’s team O2, and I need to help Zhongze with some planning; third, a course Jerry recommended earlier on data-driven thinking gave me considerable insight.

Some basic concepts

We often say services should be highly available, highly reliable, and highly stable. Broadly speaking, one might think a single success-rate–type metric is sufficient. But narrowly defined, availability, reliability, and stability are not equivalent.

Availability.Refers to the proportion of time a system operates normally during an observation period.It describes whether it can be used.

Reliability.Refers to the probability that a system will run continuously without failure over a given time interval and under given conditions.It describes whether it is correct.

Stability.Refers to the probability of errors or performance degradation trends during a run period under certain stress conditions. For front-end systems, runtime is decentralized and distributed across devices. Therefore,Stability describes how well the software performs across different device models and network conditions.

Their relationship can be understood this way: poor reliability will to some extent affect availability, but the reverse is not necessarily true. Availability is a prerequisite for reliability; stability is a further enhancement of reliability.

Next I’ll briefly introduce the concept of expectation. This section comes from the aforementioned data-driven thinking course, which used an example of explaining Murphy’s Law with expectation — a very useful application.

The definition of expectation is the probability-weighted average of possible outcomes. The expected value is entirely determined by the probability distribution. Under the law of large numbers, with enough trials it will approach our predicted expectation. Thus, expectation is an ex ante predictive value. Note that the average we commonly refer to is a post hoc statistical value. In English, these have distinct terms: expected value is the mean, while average is the average.

Existing availability metric calculation schemes

The SLAs in software delivery specifications are a set of expectations.

There is a common industry definition for availability: Availability = MTTF / (MTTF + MTTR).

MTTF, Mean Time To Failure: the average time a system operates normally before a failure occurs. The longer the MTTF, the more reliable the system.

MTTR, Mean Time To Repair: the average repair time for a repairable system, i.e., the time from failure occurrence to repair completion. The shorter the MTTR, the more maintainable the system.

Within the company, availability is defined more concretely. Repair time is defined in detail, and time is calculated by converting fault loss data into time.

The calculation formula is: Availability = (525600 - sum( min {1440 * loss, (t2 - t1)} )) / 525600.

525600 represents 525,600 minutes per year. 1440 represents 1,440 minutes per day. loss represents the daily order loss ratio; for non-transactional businesses you can choose page-view loss or increased complaint volume. t2 - t1 represents the duration of the fault.

Calculation scheme for the X metric

I tried to find a metric that more accurately describes customer impact. Since I hadn’t settled on a good name yet, I’ll call it the X metric for now.

If we define the expected customer impact as M (Mean), the probability of customer impact occurring as P (Probability), and the severity of customer impact as D (Degree), then M = P * D. For clarity, I use a transactional business as an example. In such a business, P = fault duration / annual time, and D = lost orders / annual orders.

The X metric can then be calculated as follows:

(1 - sqrt( (lost orders / annual orders) * (fault duration / annual time) )) * 100%

Implementing this metric in production requires some careful, practical thinking.

Last updated