2021-07-15: increase in 500 errors - each UNION query must have the same number of columns

Current Status

A database migration caused a inconsistent version of the database schema to be cached in Rails processes. This resulted in certain processes generating SQL queries invalid with the new schema. Restarting the Rails processes triggered them to reload the schema and resolved the errors.

Timeline

View recent production deployment and configuration events / gcp events (internal only)

All times UTC.

2021-07-15

18:28 - Migration job for canary started
18:31 - Observed large spike in 500 errors
18:35 - Identified malformed queries leading to errors
18:42 - Posted first event to the Status Page (customer comms start)
18:45 - Identified errors are being triggering very high volume of rolled back transactions in postgres.
18:53 - Identified time correlation of migrations with error.
18:54 - Confirming with DB team if there is any pk migration changes
19:00 - Reaching out to larger part of development teams for additional observations, specifically the database team as well.
19:05 - Identified that clearing Rails schema cache via restart may resolve the errors
19:10 - Completed restart and confirmation of fix on first web node
19:10 - Starting to roll through the puma fleet as well as k8s nodes to complete restarts to recognize new schema (web, sidekiq, api, web sockets, git)
19:14 - kibana logs indicate 500s now almost completely gone for web fleet, seeing improvement in API as well
19:20 - Incident marked as mitigated

Corrective Actions

Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.

Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.

Incident Review

Summary

Service(s) affected: Web, API, and Sidekiq
Team attribution:
Time to detection: ~3m
Minutes downtime or degradation: ~49

Metrics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. All customers of GitLab.com
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. 500 style errors when accessing projects via web, and API calls for group informations, among other calls that referenced the problem table in the database. Sometimes this manifested as a 5xx error from a request to verify if a user had proper credentials via API.
How many customers were affected?
1. Via web rails, 18,560 unique user accounts encountered the ActiveRecord::StatementInvalid exception during this incident.
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. ...

What were the root causes?

Several queries on the members table had mixed SELECT members.* with a union statement that enumerated the column names explicitly(for instance members.id, members.created_at, etc). The SELECT members.* relied on a cached representation of the columns(w/o our newly added column of invite_email_success) and that didn't match the explicitly enumerated version that had the new column invite_email_success.
- The addition of the invite_email_success added in gitlab-org/gitlab!65078 (merged) triggered the above condition and produced the invalid queries that caused the 500 errors.
- Taken from https://gitlab.com/gitlab-org/gitlab/-/issues/333562#note_630076014

Incident Response Analysis

How was the incident detected?
1. An incident was declared via the woodhouse tool at 18:34 UTC. This was most likely a manual observation of the symptoms.
How could detection time be improved?
1. The manually created incident was the first alert, but more notifications were sent to the EOC shortly after this. Our time to detection was 3 minutes roughly.
How was the root cause diagnosed?
1. Examining failed database queries, and testing the hypothesis that the rails DB cache was not accurate after canary migrations. The test was to restart/HUP a single web node and monitor for the specific exceptions.
How could time to diagnosis be improved?
1. ...
How did we reach the point where we knew how to mitigate the impact?
1. After a positive single node restart to confirm the hypothesis of the db cache not being up to date, a plan was put into place to restart all the VM and GKE instances that were affected.
How could time to mitigation be improved?
1. Executing rolling restarts of the VM fleet was simple enough, but there could be some better improvements to our GKE tooling to allow restarts to be initiated and run more easily and with less chance for error.
What went well?
1. Many eyes on the problem made running down information quick. Having a Google Doc up quickly and lots of good documentation of our actions is very helpful for review of the incident.

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. There have been past incidents that were caused by a migration in Canary causing a production schema problem, but in this case, it appears that the specific problem was with the schema cache.
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. Setup a mixed deployment test environment - Having a test environment that reproduces the same cny/main stages of GPRD could help see these issues before they affect GitLab.com users
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. Setup GitLab MailGun endpoint for permanent failure on invite emails

Lessons Learned

Guidelines

Blameless RCA Guideline

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)
Google Doc

Edited Jul 28, 2021 by Amy Phillips