2021-07-15: increase in 500 errors - each UNION query must have the same number of columns
Current Status
A database migration caused a inconsistent version of the database schema to be cached in Rails processes. This resulted in certain processes generating SQL queries invalid with the new schema. Restarting the Rails processes triggered them to reload the schema and resolved the errors.
Timeline
View recent production deployment and configuration events / gcp events (internal only)
All times UTC.
2021-07-15
- 18:28 - Migration job for canary started
- 18:31 - Observed large spike in 500 errors
- 18:35 - Identified malformed queries leading to errors
- 18:42 - Posted first event to the Status Page (customer comms start)
- 18:45 - Identified errors are being triggering very high volume of rolled back transactions in postgres.
- 18:53 - Identified time correlation of migrations with error.
- 18:54 - Confirming with DB team if there is any pk migration changes
- 19:00 - Reaching out to larger part of development teams for additional observations, specifically the database team as well.
- 19:05 - Identified that clearing Rails schema cache via restart may resolve the errors
- 19:10 - Completed restart and confirmation of fix on first web node
- 19:10 - Starting to roll through the puma fleet as well as k8s nodes to complete restarts to recognize new schema (web, sidekiq, api, web sockets, git)
- 19:14 - kibana logs indicate 500s now almost completely gone for web fleet, seeing improvement in API as well
- 19:20 - Incident marked as mitigated
Corrective Actions
Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.
-
Setup a mixed deployment test environment -
Revert MR that added the new column to members to prevent similar issues with self-managed installs -
Correct queries that triggered the described incident -
Detect the possible db query issues in CI/test -
Remove the ActiveRecord::QueryMethods#build_select
override -
Create ability to restart puma on Kubernetes deployed services -
Consider a Rubocop rule that avoids a building a subquery with attribute_names.
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.
Incident Review
Summary
- Service(s) affected: Web, API, and Sidekiq
- Team attribution:
- Time to detection: ~3m
- Minutes downtime or degradation: ~49
Metrics
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- All customers of GitLab.com
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- 500 style errors when accessing projects via web, and API calls for group informations, among other calls that referenced the problem table in the database. Sometimes this manifested as a 5xx error from a request to verify if a user had proper credentials via API.
-
How many customers were affected?
- Via web rails, 18,560 unique user accounts encountered the
ActiveRecord::StatementInvalid
exception during this incident.
- Via web rails, 18,560 unique user accounts encountered the
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- ...
What were the root causes?
- Several queries on the
members
table had mixedSELECT members.*
with aunion
statement that enumerated the column names explicitly(for instancemembers.id, members.created_at
, etc). TheSELECT members.*
relied on a cached representation of the columns(w/o our newly added column ofinvite_email_success
) and that didn't match the explicitly enumerated version that had the new columninvite_email_success
.- The addition of the
invite_email_success
added in gitlab-org/gitlab!65078 (merged) triggered the above condition and produced the invalid queries that caused the500
errors. - Taken from https://gitlab.com/gitlab-org/gitlab/-/issues/333562#note_630076014
- The addition of the
Incident Response Analysis
-
How was the incident detected?
- An incident was declared via the woodhouse tool at 18:34 UTC. This was most likely a manual observation of the symptoms.
-
How could detection time be improved?
- The manually created incident was the first alert, but more notifications were sent to the EOC shortly after this. Our time to detection was 3 minutes roughly.
-
How was the root cause diagnosed?
- Examining failed database queries, and testing the hypothesis that the rails DB cache was not accurate after canary migrations. The test was to restart/HUP a single web node and monitor for the specific exceptions.
-
How could time to diagnosis be improved?
- ...
-
How did we reach the point where we knew how to mitigate the impact?
- After a positive single node restart to confirm the hypothesis of the db cache not being up to date, a plan was put into place to restart all the VM and GKE instances that were affected.
-
How could time to mitigation be improved?
- Executing rolling restarts of the VM fleet was simple enough, but there could be some better improvements to our GKE tooling to allow restarts to be initiated and run more easily and with less chance for error.
-
What went well?
- Many eyes on the problem made running down information quick. Having a Google Doc up quickly and lots of good documentation of our actions is very helpful for review of the incident.
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- There have been past incidents that were caused by a migration in Canary causing a production schema problem, but in this case, it appears that the specific problem was with the schema cache.
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- Setup a mixed deployment test environment - Having a test environment that reproduces the same cny/main stages of GPRD could help see these issues before they affect GitLab.com users
- Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
Lessons Learned
- ...
Guidelines
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)
- Google Doc