W29 - Business Metric Monitoring & Full-Chain Gradual Rollout

Two takeaways emerged from Wednesday's US payment online incident.

First, a brief review and postmortem. I participated in several critical stages of this case from problem detection, response, and investigation to eliminating the impact. Although the ultimate scope and consequences were not particularly severe, looking back the entire incident’s risk exposureto some extent revealed our own systemic vulnerabilities and is definitely worth learning from.

  • Problem detection and response were passive and slow.The incident occurred on the online banking payment channel: after users were redirected to online banking, both upstream and downstream systems were waiting for the unified payment success callback. The abnormality happened on the bank side of the payment flow, which neither the cashier front end nor back end detected. So the issue was discovered by customers reporting to the business team, which then informed the cashier. As a result, when we received大量反馈 on Wednesday afternoon, we had no clear idea how long the issue had been occurring or which version introduced it. That greatly complicated the investigation phase.

  • Investigation was chaotic and disorderly.As noted above, because we couldn't determine which version introduced the problem, the full investigation had to cover the entire stack from front end to the farthest downstream systems, involving everyone and creating chaos on site. The first reaction when handling an incident should be to stop the loss and roll back promptly, but given our situation at the time, no one knew which node, at what time, or which deployed version caused the issue, so root cause localization took considerably longer.

From the above, I drew two lessons.

  • First: firmly advance business-metric monitoring.For payments, the most important business metric ispayment success rate. At first I didn't fully appreciate the benefits of real-time monitoring of this metric, but this case gave me the impetus to push for real-time business metric monitoring. In scenarios that depend on third parties—such as redirects to banks, WeChat, Alipay, etc.—we cannot perceive anomalies on the dependency side except by monitoring the payment success rate in real time. Currently, the B-side payment success rate is generated as a D+1 report, not real-time and lacks monitoring.There are two implementation paths to enable real-time monitoring.One is to work with the cashier back end and the data team to build the capability. We tried this before, but the other side lacked motivation; this incident might change their attitude. The other path is for the front end to build it independently via Raptor. The front end can create custom composite metrics to monitor the payment success rate trend; although the absolute values on the front end aren’t precise, they are sufficient to detect problems in time.In the Q3 plan, building the cashier front-end monitoring system will be prioritized.

  • Second: advance full-stack canary releases for B2B payments.After the case, the US payments team set up a working group to improve monitoring and alerts, and we can use this opportunity to push full-stack canary releases across front and back ends. Currently the cashier front end has two optional canary strategies. One is routing by the trailing digits of the tradeno, which can cause the same user to place multiple orders that redirect to different cashier versions, severely harming the user experience. The other is deploying canary instances on the front end and using a whitelist of users for online validation before a full rollout. Neither is optimal.In Q3 I hope we can establish full-stack canary releases keyed by unique user identifiers to ensure safe, smooth version iterations and to lay the groundwork for A/B testing.

Last updated