Skip to content

Increased errors on Gitlab.com during deploy of GitLab EE 10.3.0-rc3

Summary

On Wed, Dec 20th GitLab EE 10.3.0-rc3 was deployed to Gitlab.com. Immediately after the deployment there were several issues that had major customer impact, these were:

  1. Patched Gitaly binary: There was a patched version of Gitaly deployed to Gitlab.com that was not compatible with version 10.3.0-rc3. To correct this we removed the custom Gitaly binary path so that Gitaly used the bundled omnibus version.
  2. Sidekiq Database Misconfiguration: The standby pgbouncers on the standby postgresql servers were not configured to map the gitlabhq_production_sidekiq to the gitlabhq_production database. Sidekiq started to connect to the standbys in this release. To correct this the mapping configuration was updated on the standby posgresql servers.
  3. Feature check loading the db: There was increased db load on the primary db which was causing sidekiq jobs to timeout, resulting in processed jobs to back up. Checking the feature flag for prometheus measurements was the cause of this additional load. To correct this a patch was applied to disable the feature flag check.
  4. CI/CD artifacts failing: New artifacts were no longer working for all customers on Gitlab.com. Because of a patch change to where artifacts were stored on disk they were no longer being written to the storage server but instead were being written locally. To correct this the artifact mountpoint was changed and artifacts that were written locally to the api fleet were migrated to the storage server

Patched Gitaly binary

Analysis

  • How was the incident detected?
  • Critical alert and page for increased error rates on Gitlab.com.
  • Is there anything that could have been done to improve the time to detection?
  • Alarm sooner for Gitaly version mismatch.
  • How was the root cause discovered?
  • Was this incident triggered by a change?
  • It was triggered by a configuration change where we are running a custom Gitaly binary instead of the omnibus version. The deployment of GitLab EE 10.3.0-rc3 triggered this as it was not compatible with the custom binary.
  • Was there an existing issue that would have either prevented this incident or reduced the impact?
  • Would it have been possible to have caught this issue on staging?
  • If staging was kept in lockstep configuration with production with artificial load and continuous QA we would have caught this before production.

Corrective Actions

Sidekiq database misconfiguration

Analysis

  • How was the incident detected?
  • Errors in sentry, inspecting the pgbouncer logs and noticing the missing db errors.
  • Is there anything that could have been done to improve the time to detection?
  • There may be specific alarms related to pgbouncer, we could possibly emit metrics for errors in the pgbouncer logs that would have directed us sooner to the misconfiguration.
  • How was the root cause discovered?
  • By inspecting the pgbouncer logs and searching through issues to understand why we were using a separate db for sidekiq.
  • Was this incident triggered by a change?
  • The new version of GitLab where we started connecting to this db from sidekiq was the change that triggered this.
  • Was there an existing issue that would have either prevented this incident or reduced the impact?
  • No.
  • Would it have been possible to have caught this issue on staging?
  • If staging was kept in lockstep configuration with production with artificial load and continuous QA we would have caught this before production.

Corrective Actions

Feature check loading the db

Analysis

  • How was the incident detected?
  • Multiple alerts for the database and sidekiq jobs piling up.
  • Is there anything that could have been done to improve the time to detection?
  • We could have better, more specific alarming
  • How was the root cause discovered?
  • Looking at graphs and seeing that job time had increased, looking at postgres tuple statistics.
  • Was this incident triggered by a change?
  • Deployment of GitLab EE 10.3.0-rc3 triggered this incident.
  • Was there an existing issue that would have either prevented this incident or reduced the impact?
  • Would it have been possible to have caught this issue on staging?
  • If staging was kept in lockstep configuration with production with artificial load and continuous QA we would have caught this before production.

Corrective Actions

CI/CD artifacts failing

Analysis

  • How was the incident detected?
  • Users reporting issues with broken artifacts
  • Is there anything that could have been done to improve the time to detection?
  • Monitoring of artifact errors
  • How was the root cause discovered?
  • A GitLab developer realized that there was a change in this release that changed the location of artifacts on disk. Investigating the api nodes confirmed this theory.
  • Was this incident triggered by a change?
  • Deployment of GitLab EE 10.3.0-rc3 triggered this incident
  • Was there an existing issue that would have either prevented this incident or reduced the impact?
  • No.
  • Would it have been possible to have caught this issue on staging?
  • If staging was kept in lockstep configuration with production with artificial load and continuous QA we would have caught this before production.

Corrective Actions

Timeline

  • 08:50 - deploy of GitLab EE 10.3.0-rc3 started
  • 09:23 - deploy migrations completed
  • 09:25 - alert for increased error rate
  • 09:31 - report of increased error rate internally on slack
  • 09:37 - verify the running gitaly version with the gitaly team
  • 09:37 - tweet sent about increased error rate
  • 09:45 - gitaly version switched to the omnibus location
  • 09:47 - pingdom reports gitlab.com is down
  • 09:48 - pingdom page for gitlab.com back up
  • 09:55 - unicorn hup sent to front end fleet
  • 10:00 - unicorn hup and restart sent to sidekiq fleet
  • 10:00 - alert for increased error rate cleared
  • 10:08 - deploy continues
  • 10:21 - PD alert for PostgreSQL replication slot with an stale xmin which can cause bloat on the primary
  • 10:35 - stale xmin alert cleared
  • 10:35 - alert for sidekiq Large amount of ProcessCommitWorker queued jobs: 6585, we believe this is due database load
  • 11:12 - @stanhu Notices that feature_gates is being hammered
  • 12:00 - We update the pgbouncer config on 01 and 02 so that it has the proper db mapping for sidekiq, not sure if this is related but here are not as many errors now in the logs.
gitlabhq_production = host=127.0.0.1 port=5432 auth_user=pgbouncer
gitlabhq_production_sidekiq = host=127.0.0.1 port=5432 pool_size=150 auth_user=pgbouncer dbname=gitlabhq_production
    -gitlab_rails['artifacts_object_store_enabled'] = true
    +gitlab_rails['artifacts_object_store_enabled'] = false
  • 13:19 - Restarted sidekiq besteffort on prod for the configuration update above.
  • 13:22 - Deleted the object_storage_upload queue to remove pending jobs.
  • 14:30 - The artifacts_object_store_enabled did have customer impact as reported below in the comments, reverted the configuration change and forced chef-client runs to update.
  • 14:40 - Started to distribute go-1.9 version of gitaly-0.58.0 across fleet, src: https://gitlab.com/gitlab-com/infrastructure/issues/3392#note_52100255
  • 15:02 - Gitaly is updated across the fleet.
  • 15:41 - We are applying this configuration change to disable artifact uploading https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1423 . We believe the reason for the artifact download problem is that artifacts are being deleted on disk after the object storage upload fails.
  "artifacts_enabled": true,
          "artifacts_object_store_enabled": true,
          "artifacts_object_store_remote_directory": "gitlab-artifacts",
          "artifacts_object_store_background_upload": false,
  • 16:04 - Configuration change has been applied across the gitlab.com fleet for setting artifacts_object_store_background_upload = false
  • 16:30 - We've identified an issue where artifact location on disk changed in the new release. This is resulting in artifacts being written to the / partition instead of the shared nfs server. We working to resolve this by changing the mountpoint on the api fleet and migrating the existing artifacts.
  • 16:30 - Testing of pulling api-03 out of rotation and and rsync disks for validation
  • 16:41 - api-03 sync and placed back in rotation, monitoring, all results positive so far.
  • 16:43 - Verified that all web and sidekiq nodes are pointing at right mount locations.
  • 16:46 - Start sync of all api-xx servers in groups of three at a time.
  • 17:12 - Half of API fleet returned to production with correct mounts.
  • 17:53 - All API servers returned to production and verified.
  • 17:54 - Tweeted "Artifacts access has been restored on GitLab.com"
Edited by John Jarvis