Skip to content

Try out new circuitbreaker updates in canary

The circuitbreaker has been adjusted on several fronts:

  1. https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/15426: The check now happens inside a separate request to the unicorn of each host triggered each second by a process running on the host. This means the check does not need to be performed each request.

  2. https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/15612: Improved metrics for the storage checks will be available as soon as the process is running, so without actually preventing access to storage.

  3. https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/15613: The health page should open again.

We should try this out again after 10.3 is deployed.

What are we going to do?

Try out the changes to the circuit breaker in canary.

Why are we doing it?

To validate the new behaviour, in order to be able to turn it on in production

When are we going to do it?

  • Start time: ___

  • Duration: ___

  • Estimated end time: ___

TBD

How are we going to do it?

  1. Enable the process that calls out to the unicorn to perform access checks
  2. Check that the health page opens: https://gitlab.com/admin/health_check
  3. Check that metrics are coming into Prometheus, (variable name is circuitbreaker_storage_check_duration_seconds)
  4. Use the network gnome to block access to the NFS where gitlab-ce lives (https://gitlab.com/gl-infra/network-gnome).
  5. Check that the failure count goes up on the health page: https://gitlab.com/admin/health_check
  6. Check the metrics for the failing storage is Prometheus.
  7. Enable the circuitbreaker: (https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1269)
  8. Check that some pages of gitlab-ce are still accessible
  9. Check that repo of other projects is still accessible.
  10. Re-enable access to the NFS shard
  11. Check that the service recovers.

How are we preparing for it?

We need to enable this setting on all the nodes of the canary fleet that have unicorns installed.

What can we check afterwards to ensure that it's working?

https://canary.gitlab.com/gitlab-org/gitlab-ce should break on the repository page, other pages should still work. https://canary.gitlab.com/fdroid/fdroidclient/ should remain available.

Impact

  • Type of impact: Production should not be impacted.

  • What will happen: ___

  • Do we expect downtime? (set the override in pagerduty): ___

What is the rollback plan?

Revert the changes in the cookbook

Monitoring

  • Graphs to check for failures:


  • Graphs to check for improvements:


  • Alerts that may trigger:


[IF NEEDED]

Google Doc to follow during the change (remember to link in the on-call log)


Scheduling

Schedule a downtime in the production calendar twice as long as your worst duration estimate, be pessimistic (better safe than sorry)

When things go wrong (downtime or service degradation)

  • Label the change issue as outage

  • Perform a blameless post mortem

References

cc @ilyaf @DouweM @andrewn