W05 - Risk Exposure of the Payment SDK

In the B2B payment QuickDonkey experience optimization initiative, we redefined a set of user-session-level metrics to better measure changes before and after optimization, aiming to more objectively describe B2B user experience and service performance. One task was to supplement the reporting data from the current JS SDK according to the new metric structure.

The JS SDK is a critical component for the checkout: the vast majority of checkout traffic uses this SDK to access payment services, with estimated daily calls around 500k. Because of its large impact surface, I personally maintain a high threshold for making changes to the SDK. Its logic has also been very stable, and after a year of working on B2B payments there have been no iterations.

The development effort and scope to add the tracking data were small, but I delayed the release for a week. It wasn’t procrastination; I was hesitant. The release process exposes a chain of risks, and it takes time to mentally prepare for them.

The SDK release process feels uncontrollable. It lacks visibility and has no clear control points at key stages. The SDK is hosted on Burst, and after reviewing Burst’s dashboard and roadmap it feels somewhat neglected. After committing code, the deployment relies on a single command-line tool provided by Burst. That command runs as a black box; when it finishes, it means Burst’s origin machines have been updated. Then we must manually refresh the CDN, which is like blowing on a dandelion: you have no idea where the seeds will land. We have to guess by experience the CDN nodes’ refresh progress, whether they are fully updated, and when end users will actually receive the new SDK due to client-side caching. The process cannot be throttled for gradual release, there is no fast rollback plan, and—more importantly—the whole flow is silent: there is no approval or notifications to relevant people.

After release, we can only monitor SDK internal exceptions. For metrics like call volume and service stability, we depend on information from Burst and the CDN provider, but the completeness and accuracy of that data are questionable.

I consulted Owl and LX about the issues above. None of us had a perfect solution, but their input gave us ideas for future release governance. A major difference between the JS SDK and native SDKs is the version-management challenge introduced by full dynamism. If the business side embeds an SDK URL that must not change, that effectively hides the concept of versioning from them.

The SDK also has other internal issues, such as polluting the global window object. This year we plan to include SDK governance in the B2B front-end roadmap. With current daily volumes of hundreds of thousands of calls it’s manageable; at millions per day it would become a serious risk.

PreviousW04 - New Understandings of Resilience NextW09 - Further Understanding Merchant Business

Last updated 3 years ago