W46 - Database Migration Case Study

I read a post summarizing lessons from a database migration —How We Migrated 1 Billion Records from DB1 to DB2 Without Downtime, a post reviewing a zero-downtime migration of a financial database cluster with one billion records.

It resonated strongly with our core approach to technical risk control. Incremental risk mitigation, where each step can be verified and rolled back independently. Observability-first: from the database WAL logs and cache hit rates to business metrics like order volume and revenue traffic, continuous monitoring lets the team detect anomalies quickly. I also noticed a clever trick used in database or backend service migrations: Shadow Reads.

Data migration is, at its core, a complex system design problem. The author outlines five key design points.

First, perform bulk migration of "cold data" using sharding, parallel processing, and by disabling indexes.

Second, introduce a dual-write mechanism combined with a Kafka retry queue to keep real-time traffic synchronized between the old and new databases while preserving idempotency.

Third, use Shadow Reads for online validation. User requests still read from the old database, but each query is quietly executed against the new database in the background and the results are compared. This process ran for weeks and uncovered many issues that testing environments never caught — for example, timezone handling differences, default behaviors for NULL values, and different collation rules. It reminded me that last year’s checkout backend migration likely used a similar tactic, though it’s much harder to replicate on the frontend.

Fourth, perform the cutover during low-traffic periods with caution, using cache warm-up, rollback plans, and close monitoring to ensure a smooth transition.

Finally, the author emphasizes the crucial role of strong observability throughout the migration.

Shadow Reads are hard to apply to the frontend because each user’s view is exclusive: once they hit the new version, they can’t also see the old version. Database Shadow Reads are an "invisible comparison of result sets"; frontend correctness is much more complex than data consistency. It involves interactive behavior, performance experience, accessibility consistency, and compatibility across devices — all lacking low-cost verification methods. Traffic gradual rollouts, user behavior analytics, business-metric monitoring, and manual spot checks remain the most effective approaches.

Last updated