W29 - Business Metric Monitoring & Full-Chain Gradual Release

From the online incident in US payments on Wednesday, two lines of thought emerged.

First, a brief review and after-action reflection. I was involved in several key stages of this case from problem detection, response, and investigation to eliminating the impact. Although the final scope and consequences were not particularly severe, looking back revealed a continuous exposure to risk, which to some extent exposed our own systemic weaknesses and is definitely worth learning from.

  • Problem detection and response were passive and slow.The incident occurred in the online banking payment channel. After users were redirected to the bank site, both upstream and downstream systems waited for a unified payment success callback. The anomaly in the user payment flow happened on the bank's side, and neither the checkout frontend nor backend detected it. So this issue was discovered through customer reports: customers reported to the business team, and the business team reported to the checkout. After receiving a large number of reports on Wednesday afternoon, we still didn't know how long the problem had been occurring or which release introduced it. That made the investigation much harder.

  • The investigation was chaotic and unstructured.As noted above, because we couldn't identify the release that introduced the problem, the investigation became a full end-to-end sweep from frontend to the most downstream services, involving all roles and creating disorder on site. The first response in handling incidents should be loss containment and timely rollback, but in that situation nobody knew which node, at what time, or which deployed version caused the issue, so fault localization took a long time.

The above leads me to two insights.

  • First: resolutely advance business-metric monitoring.For payments, the most important business metric ispayment success rate. At first I didn't fully appreciate the benefits of real-time monitoring of this metric, but this case gave me the impetus to push for real-time business-metric monitoring. In scenarios that depend on third parties—such as redirects to banks, WeChat, Alipay, etc.—we don't perceive abnormalities on the dependent side; only real-time monitoring of payment success rate can reveal them. Currently B-side payment success rate is produced as a D+1 report, not real-time and not monitored.There are two implementation paths to push for real-time monitoring.One is to drive capability building in the checkout backend and the data team. We have tried that before but met insufficient motivation on their side; it's possible their attitude will change after this incident. The other is for the frontend to build it in Raptor independently. The frontend can define composite custom metrics to monitor the payment success rate trend period-over-period—while frontend absolute values may not be precise, they are sufficient to detect issues promptly.In Q3 planning, building the monitoring system for the checkout frontend will be a priority.

  • Second: advance end-to-end canary releases for B2B payments.After the incident, the US payments team formed a dedicated group to optimize monitoring and alerting; we can use this opportunity to push full-stack canary releases. Currently the checkout frontend has two optional canary strategies. One is traffic splitting by the last digit of tradeno, but this can cause the same user to place multiple orders that get directed to different checkout versions, severely hurting experience. The other is to deploy canary machines on the frontend and configure a whitelist of users for online validation; after validation, roll out to all users. Neither is optimal.We should aim in Q3 to establish end-to-end canary releases based on a unique user identifier, ensuring safe and smooth future version iterations and laying the groundwork for A/B testing capabilities.

Last updated