Migrating One Region Without a Maintenance Window Meltdown

A retail platform needed to leave an overpriced region before contract renewal. Leadership wanted a single Saturday cutover; engineering pushed back with a phased plan instead.

Phase one: stateless frontends

They duplicated load balancers and autoscaling groups in the target region, lowered DNS TTLs a week ahead, and shifted 10% of traffic with weighted records. Errors were visible within minutes — not after the whole stack had moved.

Phase two: data paths

Stateful services stayed read-replicated for two weeks. Writes still hit the old region until replication lag stayed under an agreed threshold for seven consecutive days.

What almost went wrong

A hard-coded region name in a background job sent exports to the wrong bucket for a weekend. The job had never been on the traffic map because it was "internal only."

Takeaway: migrations are sequencing problems. Dates on a slide are less useful than ordered traffic shifts and honest TTL planning.