Increased errors on Gitlab.com during deploy of GitLab EE 10.3.0-rc3
Summary
On Wed, Dec 20th GitLab EE 10.3.0-rc3 was deployed to Gitlab.com. Immediately after the deployment there were several issues that had major customer impact, these were:
- Patched Gitaly binary: There was a patched version of Gitaly deployed to Gitlab.com that was not compatible with version 10.3.0-rc3. To correct this we removed the custom Gitaly binary path so that Gitaly used the bundled omnibus version.
-
Sidekiq Database Misconfiguration: The standby pgbouncers on the standby postgresql servers were not configured to map the
gitlabhq_production_sidekiq
to thegitlabhq_production
database. Sidekiq started to connect to the standbys in this release. To correct this the mapping configuration was updated on the standby posgresql servers. - Feature check loading the db: There was increased db load on the primary db which was causing sidekiq jobs to timeout, resulting in processed jobs to back up. Checking the feature flag for prometheus measurements was the cause of this additional load. To correct this a patch was applied to disable the feature flag check.
- CI/CD artifacts failing: New artifacts were no longer working for all customers on Gitlab.com. Because of a patch change to where artifacts were stored on disk they were no longer being written to the storage server but instead were being written locally. To correct this the artifact mountpoint was changed and artifacts that were written locally to the api fleet were migrated to the storage server
Patched Gitaly binary
Analysis
- How was the incident detected?
- Critical alert and page for increased error rates on Gitlab.com.
- Is there anything that could have been done to improve the time to detection?
- Alarm sooner for Gitaly version mismatch.
- How was the root cause discovered?
- Discussion on slack
- Analyzing 500 errors on sentry. https://sentry.gitlap.com/gitlab/gitlabcom/issues/113417/
- Was this incident triggered by a change?
- It was triggered by a configuration change where we are running a custom Gitaly binary instead of the omnibus version. The deployment of GitLab EE 10.3.0-rc3 triggered this as it was not compatible with the custom binary.
- Was there an existing issue that would have either prevented this incident or reduced the impact?
- new version of Gitaly compiled with 1.9 in the omnibus package.: gitlab-org/omnibus-gitlab#3046 (closed)
- preprod to create an environment in lockstep with production: preprod provisioning (multiple)
- Would it have been possible to have caught this issue on staging?
- If staging was kept in lockstep configuration with production with artificial load and continuous QA we would have caught this before production.
Corrective Actions
- Block (or at the very least prompt) in takeoff if there is a custom binary set for Gitaly. https://gitlab.com/gitlab-org/takeoff/issues/42
- Version checking between clients and gitaly gitlab-org/gitaly#853 (closed)
Sidekiq database misconfiguration
Analysis
- How was the incident detected?
- Errors in sentry, inspecting the pgbouncer logs and noticing the missing db errors.
- Is there anything that could have been done to improve the time to detection?
- There may be specific alarms related to pgbouncer, we could possibly emit metrics for errors in the pgbouncer logs that would have directed us sooner to the misconfiguration.
- How was the root cause discovered?
- By inspecting the pgbouncer logs and searching through issues to understand why we were using a separate db for sidekiq.
- Was this incident triggered by a change?
- The new version of GitLab where we started connecting to this db from sidekiq was the change that triggered this.
- Was there an existing issue that would have either prevented this incident or reduced the impact?
- No.
- Would it have been possible to have caught this issue on staging?
- If staging was kept in lockstep configuration with production with artificial load and continuous QA we would have caught this before production.
Corrective Actions
- Alert on pgbouncer errors https://gitlab.com/gitlab-com/infrastructure/issues/3442
Feature check loading the db
Analysis
- How was the incident detected?
- Multiple alerts for the database and sidekiq jobs piling up.
- Is there anything that could have been done to improve the time to detection?
- We could have better, more specific alarming
- How was the root cause discovered?
- Looking at graphs and seeing that job time had increased, looking at postgres tuple statistics.
- Was this incident triggered by a change?
- Deployment of GitLab EE 10.3.0-rc3 triggered this incident.
- Was there an existing issue that would have either prevented this incident or reduced the impact?
- Yes, this change had not yet made it into the release:
- Would it have been possible to have caught this issue on staging?
- If staging was kept in lockstep configuration with production with artificial load and continuous QA we would have caught this before production.
Corrective Actions
- Alerting for spike in sequential reads https://gitlab.com/gitlab-com/infrastructure/issues/3443
CI/CD artifacts failing
Analysis
- How was the incident detected?
- Users reporting issues with broken artifacts
- Is there anything that could have been done to improve the time to detection?
- Monitoring of artifact errors
- How was the root cause discovered?
- A GitLab developer realized that there was a change in this release that changed the location of artifacts on disk. Investigating the api nodes confirmed this theory.
- Was this incident triggered by a change?
- Deployment of GitLab EE 10.3.0-rc3 triggered this incident
- Was there an existing issue that would have either prevented this incident or reduced the impact?
- No.
- Would it have been possible to have caught this issue on staging?
- If staging was kept in lockstep configuration with production with artificial load and continuous QA we would have caught this before production.
Corrective Actions
- Attach runners to staging and have a way to validate artifact related configuration. https://gitlab.com/gitlab-com/infrastructure/issues/3439
Timeline
- 08:50 - deploy of GitLab EE 10.3.0-rc3 started
- 09:23 - deploy migrations completed
- 09:25 - alert for increased error rate
- 09:31 - report of increased error rate internally on slack
- 09:37 - verify the running gitaly version with the gitaly team
- 09:37 - tweet sent about increased error rate
- 09:45 - gitaly version switched to the omnibus location
- 09:47 - pingdom reports gitlab.com is down
- 09:48 - pingdom page for gitlab.com back up
- 09:55 - unicorn hup sent to front end fleet
- 10:00 - unicorn hup and restart sent to sidekiq fleet
- 10:00 - alert for increased error rate cleared
- 10:08 - deploy continues
- 10:21 - PD alert for PostgreSQL replication slot with an stale xmin which can cause bloat on the primary
- 10:35 - stale xmin alert cleared
- 10:35 - alert for sidekiq Large amount of ProcessCommitWorker queued jobs: 6585, we believe this is due database load
- 11:12 - @stanhu Notices that feature_gates is being hammered
- 12:00 - We update the pgbouncer config on 01 and 02 so that it has the proper db mapping for sidekiq, not sure if this is related but here are not as many errors now in the logs.
gitlabhq_production = host=127.0.0.1 port=5432 auth_user=pgbouncer
gitlabhq_production_sidekiq = host=127.0.0.1 port=5432 pool_size=150 auth_user=pgbouncer dbname=gitlabhq_production
- 12:07 - Considering a patch on production https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/15800/diffs (slack)
- 12:41 - Applied patch to sidekiq - https://dev.gitlab.org/gitlab/post-deployment-patches/merge_requests/18
- 13:07 - We see that there is an error spike in object storage upload worker This does not have customer impact but should be disabled, applied a config update:
-gitlab_rails['artifacts_object_store_enabled'] = true
+gitlab_rails['artifacts_object_store_enabled'] = false
- 13:19 - Restarted sidekiq besteffort on prod for the configuration update above.
- 13:22 - Deleted the object_storage_upload queue to remove pending jobs.
- 14:30 - The
artifacts_object_store_enabled
did have customer impact as reported below in the comments, reverted the configuration change and forced chef-client runs to update. - 14:40 - Started to distribute go-1.9 version of gitaly-0.58.0 across fleet, src: https://gitlab.com/gitlab-com/infrastructure/issues/3392#note_52100255
- 15:02 - Gitaly is updated across the fleet.
- 15:41 - We are applying this configuration change to disable artifact uploading https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1423 . We believe the reason for the artifact download problem is that artifacts are being deleted on disk after the object storage upload fails.
"artifacts_enabled": true,
"artifacts_object_store_enabled": true,
"artifacts_object_store_remote_directory": "gitlab-artifacts",
"artifacts_object_store_background_upload": false,
- 16:04 - Configuration change has been applied across the gitlab.com fleet for setting
artifacts_object_store_background_upload = false
- 16:30 - We've identified an issue where artifact location on disk changed in the new release. This is resulting in artifacts being written to the
/
partition instead of the shared nfs server. We working to resolve this by changing the mountpoint on the api fleet and migrating the existing artifacts. - 16:30 - Testing of pulling
api-03
out of rotation and and rsync disks for validation - 16:41 -
api-03
sync and placed back in rotation, monitoring, all results positive so far. - 16:43 - Verified that all
web
andsidekiq
nodes are pointing at right mount locations. - 16:46 - Start sync of all
api-xx
servers in groups of three at a time. - 17:12 - Half of API fleet returned to production with correct mounts.
- 17:53 - All API servers returned to production and verified.
- 17:54 - Tweeted "Artifacts access has been restored on GitLab.com"
Edited by John Jarvis