Increased errors on Gitlab.com during deploy of GitLab EE 10.3.0-rc3

Summary

On Wed, Dec 20th GitLab EE 10.3.0-rc3 was deployed to Gitlab.com. Immediately after the deployment there were several issues that had major customer impact, these were:

Patched Gitaly binary: There was a patched version of Gitaly deployed to Gitlab.com that was not compatible with version 10.3.0-rc3. To correct this we removed the custom Gitaly binary path so that Gitaly used the bundled omnibus version.
Sidekiq Database Misconfiguration: The standby pgbouncers on the standby postgresql servers were not configured to map the gitlabhq_production_sidekiq to the gitlabhq_production database. Sidekiq started to connect to the standbys in this release. To correct this the mapping configuration was updated on the standby posgresql servers.
Feature check loading the db: There was increased db load on the primary db which was causing sidekiq jobs to timeout, resulting in processed jobs to back up. Checking the feature flag for prometheus measurements was the cause of this additional load. To correct this a patch was applied to disable the feature flag check.
CI/CD artifacts failing: New artifacts were no longer working for all customers on Gitlab.com. Because of a patch change to where artifacts were stored on disk they were no longer being written to the storage server but instead were being written locally. To correct this the artifact mountpoint was changed and artifacts that were written locally to the api fleet were migrated to the storage server

Patched Gitaly binary

Analysis

How was the incident detected?

Critical alert and page for increased error rates on Gitlab.com.

Is there anything that could have been done to improve the time to detection?

Alarm sooner for Gitaly version mismatch.

How was the root cause discovered?

Discussion on slack
Analyzing 500 errors on sentry. https://sentry.gitlap.com/gitlab/gitlabcom/issues/113417/

Was this incident triggered by a change?

It was triggered by a configuration change where we are running a custom Gitaly binary instead of the omnibus version. The deployment of GitLab EE 10.3.0-rc3 triggered this as it was not compatible with the custom binary.

Was there an existing issue that would have either prevented this incident or reduced the impact?

new version of Gitaly compiled with 1.9 in the omnibus package.: gitlab-org/omnibus-gitlab#3046 (closed)
preprod to create an environment in lockstep with production: preprod provisioning (multiple)

Would it have been possible to have caught this issue on staging?

If staging was kept in lockstep configuration with production with artificial load and continuous QA we would have caught this before production.

Corrective Actions

Block (or at the very least prompt) in takeoff if there is a custom binary set for Gitaly. https://gitlab.com/gitlab-org/takeoff/issues/42
Version checking between clients and gitaly gitlab-org/gitaly#853 (closed)

Sidekiq database misconfiguration

Analysis

How was the incident detected?

Errors in sentry, inspecting the pgbouncer logs and noticing the missing db errors.

Is there anything that could have been done to improve the time to detection?

There may be specific alarms related to pgbouncer, we could possibly emit metrics for errors in the pgbouncer logs that would have directed us sooner to the misconfiguration.

How was the root cause discovered?

By inspecting the pgbouncer logs and searching through issues to understand why we were using a separate db for sidekiq.

Was this incident triggered by a change?

The new version of GitLab where we started connecting to this db from sidekiq was the change that triggered this.

Was there an existing issue that would have either prevented this incident or reduced the impact?

Would it have been possible to have caught this issue on staging?

If staging was kept in lockstep configuration with production with artificial load and continuous QA we would have caught this before production.

Corrective Actions

Alert on pgbouncer errors https://gitlab.com/gitlab-com/infrastructure/issues/3442

Feature check loading the db

Analysis

How was the incident detected?

Multiple alerts for the database and sidekiq jobs piling up.

Is there anything that could have been done to improve the time to detection?

We could have better, more specific alarming

How was the root cause discovered?

Looking at graphs and seeing that job time had increased, looking at postgres tuple statistics.

Was this incident triggered by a change?

Deployment of GitLab EE 10.3.0-rc3 triggered this incident.

Was there an existing issue that would have either prevented this incident or reduced the impact?

Yes, this change had not yet made it into the release:
- https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/15800
- https://gitlab.com/gitlab-org/gitlab-ce/issues/40854

Would it have been possible to have caught this issue on staging?

If staging was kept in lockstep configuration with production with artificial load and continuous QA we would have caught this before production.

Corrective Actions

Alerting for spike in sequential reads https://gitlab.com/gitlab-com/infrastructure/issues/3443

CI/CD artifacts failing

Analysis

How was the incident detected?

Users reporting issues with broken artifacts

Is there anything that could have been done to improve the time to detection?

Monitoring of artifact errors

How was the root cause discovered?

A GitLab developer realized that there was a change in this release that changed the location of artifacts on disk. Investigating the api nodes confirmed this theory.

Was this incident triggered by a change?

Deployment of GitLab EE 10.3.0-rc3 triggered this incident

Was there an existing issue that would have either prevented this incident or reduced the impact?

Would it have been possible to have caught this issue on staging?

If staging was kept in lockstep configuration with production with artificial load and continuous QA we would have caught this before production.

Corrective Actions

Attach runners to staging and have a way to validate artifact related configuration. https://gitlab.com/gitlab-com/infrastructure/issues/3439

Timeline

08:50 - deploy of GitLab EE 10.3.0-rc3 started
09:23 - deploy migrations completed
09:25 - alert for increased error rate
09:31 - report of increased error rate internally on slack
09:37 - verify the running gitaly version with the gitaly team
09:37 - tweet sent about increased error rate
09:45 - gitaly version switched to the omnibus location
09:47 - pingdom reports gitlab.com is down
09:48 - pingdom page for gitlab.com back up
09:55 - unicorn hup sent to front end fleet
10:00 - unicorn hup and restart sent to sidekiq fleet
10:00 - alert for increased error rate cleared
10:08 - deploy continues
10:21 - PD alert for PostgreSQL replication slot with an stale xmin which can cause bloat on the primary
10:35 - stale xmin alert cleared
10:35 - alert for sidekiq Large amount of ProcessCommitWorker queued jobs: 6585, we believe this is due database load
11:12 - @stanhu Notices that feature_gates is being hammered
12:00 - We update the pgbouncer config on 01 and 02 so that it has the proper db mapping for sidekiq, not sure if this is related but here are not as many errors now in the logs.

gitlabhq_production = host=127.0.0.1 port=5432 auth_user=pgbouncer
gitlabhq_production_sidekiq = host=127.0.0.1 port=5432 pool_size=150 auth_user=pgbouncer dbname=gitlabhq_production

12:07 - Considering a patch on production https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/15800/diffs (slack)
12:41 - Applied patch to sidekiq - https://dev.gitlab.org/gitlab/post-deployment-patches/merge_requests/18
13:07 - We see that there is an error spike in object storage upload worker This does not have customer impact but should be disabled, applied a config update:

    -gitlab_rails['artifacts_object_store_enabled'] = true
    +gitlab_rails['artifacts_object_store_enabled'] = false

13:19 - Restarted sidekiq besteffort on prod for the configuration update above.
13:22 - Deleted the object_storage_upload queue to remove pending jobs.
14:30 - The artifacts_object_store_enabled did have customer impact as reported below in the comments, reverted the configuration change and forced chef-client runs to update.
14:40 - Started to distribute go-1.9 version of gitaly-0.58.0 across fleet, src: https://gitlab.com/gitlab-com/infrastructure/issues/3392#note_52100255
15:02 - Gitaly is updated across the fleet.
15:41 - We are applying this configuration change to disable artifact uploading https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1423 . We believe the reason for the artifact download problem is that artifacts are being deleted on disk after the object storage upload fails.

  "artifacts_enabled": true,
          "artifacts_object_store_enabled": true,
          "artifacts_object_store_remote_directory": "gitlab-artifacts",
          "artifacts_object_store_background_upload": false,

16:04 - Configuration change has been applied across the gitlab.com fleet for setting artifacts_object_store_background_upload = false
16:30 - We've identified an issue where artifact location on disk changed in the new release. This is resulting in artifacts being written to the / partition instead of the shared nfs server. We working to resolve this by changing the mountpoint on the api fleet and migrating the existing artifacts.
16:30 - Testing of pulling api-03 out of rotation and and rsync disks for validation
16:41 - api-03 sync and placed back in rotation, monitoring, all results positive so far.
16:43 - Verified that all web and sidekiq nodes are pointing at right mount locations.
16:46 - Start sync of all api-xx servers in groups of three at a time.
17:12 - Half of API fleet returned to production with correct mounts.
17:53 - All API servers returned to production and verified.
17:54 - Tweeted "Artifacts access has been restored on GitLab.com"

Edited Dec 21, 2017 by John Jarvis