How to revert a failed or regression-causing change in production
The standard rollback procedure referenced from Change Management → Rollback. Use this when a deploy fails, a regression is detected, or monitoring alerts indicate a customer-impacting change.
Default to rollback if any of the following are true:
The error rate, latency, or saturation alert threshold has been exceeded.
A correctness defect is producing or could produce wrong customer-facing output.
The change involves a destructive or non-idempotent operation that is likely to compound on retry.
The on-call engineer has not yet identified a root cause.
Choose roll-forward only when the fix is small, well-understood, already prepared, and the rollback path itself is risky (e.g., a database migration that cannot be cleanly reversed).The decision is owned by the on-call engineer; the CTO is informed if the rollback affects customer-impacting services.
Forward-compatible migrations (additive columns, new tables) — the revert is safe because the old code ignores the new schema.
Backwards-incompatible migrations (column rename, drop, type change) — do not run a reverse migration as part of rollback. Roll forward with a hotfix that tolerates the new schema, or isolate the failing surface behind a flag while the fix is prepared.
For any migration touching customer data, the CTO and the on-call DBA approve the rollback path before execution.
If rollback restores a previous credential surface (e.g., the change rotated a secret), confirm with the Secrets Management procedure that the previous credential is still valid; if not, rotate forward to a fresh credential rather than rolling back.
For changes affecting customers, post a status-page update per Customer Communications. The on-call lead drafts; the CTO or CISO approves before posting.
Open a retrospective entry via the Emergency Change Retro intake within 24 hours. Document: timeline, blast radius, the trigger, the rollback action taken, and follow-ups (test gaps, monitoring gaps, runbook updates).
If rollback is not feasible (e.g., destructive migration, irreversible third-party API call, customer-side state change), treat this as an incident under the Incident Response Plan and escalate to the CISO and CTO. The IR plan governs from that point.