W52 - Observing Front-End Safety Production from D2

I recently glanced through D2, which for the first time this year included a track on frontend production safety. They shared some of Alibaba’s thoughts and practices in production safety, which were somewhat enlightening and gave me a clear sense of the iteration direction for production-safety infrastructure.

They divided the construction of the whole system from 0 to 1 into three stages: single-point production-safety assurance, multiple independent-silo production-safety assurance, and a systematic frontend production-safety assurance.

On that scale, our current state corresponds roughly to the end of the first stage, around the 0.3 mark. We are basically at the single-point保障 stage with Raptor online monitoring, plus some auxiliary protection strategies that are not standardized or user-friendly and deliver limited benefit. Examples include static code scanning, engineering standard checks, and coverage reports.

We need to fill capability gaps across UI automated regression, canary (gray) monitoring, intelligent problem diagnosis, and automatic fault recovery. Completing these will bring us close to the third stage.

The third stage connects the various protection strategies, systems, and platforms into a comprehensive, systematic production-safety environment. For example: solving the separation between frontend and backend releases to improve coordinated canary deployments and full-chain load testing capacity; leveraging Cloud IDEs and production developer behavior data to better integrate the entire development pipeline; and improving automation collaboration with testing to raise the proportion of tests that can be skipped, thereby increasing efficiency.

In short, the guiding principle of production safety remains unchanged: protect people through machines and mechanisms.

Last updated