2020-09-01 users are unable to change subscriptions
Summary
2020-09-01 users are unable to change subscriptions
These errors were taking place only when there was a mismatch between canary and production. They were caused by the fact that the canary deployment included a db migration and production was not compatible with that change.
Timeline
All times UTC.
2020-08-31
- 17:49 - canary deployment finished, the deployment included this change: gitlab-org/gitlab!40470 (merged) , 500 errors are thrown when changing a plan: https://gitlab.com/gitlab-org/gitlab/-/issues/243616
2020-09-01
- 01:00 - canary is promoted to production, errors stop
- 07:05 - a developer suspects that a change landed in canary but not in production and that it's causing issues, reached out to sre-oncall to promote canary to production
- 07:27 - mwasilewski declares incident in Slack using
/incident declare
command. , the scope of impact is undetermined at this point - 07:34 - prod and canary are confirmed to be in sync, 500 errors are confirmed to be gone
Click to expand or collapse the Incident Review section.
Incident Review
Summary
- Service(s) affected:
- Team attribution:
- Minutes downtime or degradation:
Metrics
Customer Impact
- Who was impacted by this incident? (i.e. external customers, internal customers)
- What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- How many customers were affected?
- If a precise customer impact number is unknown, what is the estimated potential impact?
Incident Response Analysis
- How was the event detected?
- How could detection time be improved?
- How did we reach the point where we knew how to mitigate the impact?
- How could time to mitigation be improved?
Post Incident Analysis
- How was the root cause diagnosed?
- How could time to diagnosis be improved?
- Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
- Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?
5 Whys
Lessons Learned
Corrective Actions
Guidelines
Edited by Michal Wasilewski