Enable NFS circuitbreaker in production
What are we going to do?
We want to test the behavior of the circuit-breaker when a storage shard is failing in a production like environment.
We chose to try this on canary because it has several shards available, with viewable data.
To try this out we will set the GIT_STORAGE_CIRCUIT_BREAKER
env variable to true
in the canary fleet, and then block access to a shard using iptables.
Why are we doing it?
To see if the circuitbreaker correctly blocks access to a failing shard, keeping projects on different shards accessible.
To know the real world performance overhead of the circuit-breaker.
When are we going to do it?
TBD
- Start time: 2017-11-25 19:00 UTC
- Duration: 3h
- End time: 22:00 UTC
How are we going to do it?
-
Enable the circuit-breaker by setting the environment variable: See this MR
-
Run a script to temporarily block access to a certain shard.
-
Disable the script that blocks access to a storage shard
-
Disable the circuit-breaker by removing the env variable
GIT_STORAGE_CIRCUIT_BREAKER
How are we preparing for it?
We need to write a script that blocks access to a shard.
What can we check before starting?
We need to know what the transaction timings are before starting to determine the influence of the circuit-breaker.
What can we check afterwards to ensure that it's working?
- Accessing projects that are not on the disabled shard should work without delay
- Accessing projects on the disabled shard should render a 503 immediately.
Impact
- Type of impact: internal
- What will happen: We don't expect any impact in GitLab.com during this test. But this might cause extra IO.
- Do we expect downtime? (set the override in pagerduty): No
How are we communicating this to our customers?
This had no influence on GitLab.com
What is the rollback plan?
-
Removing the env variable will bring canary back to normal.
-
Stop the script that's blocking access to a shard
Monitoring
- Graphs to check for failures:
- Alerts that may trigger:
- We should get
Inaccessible
&CircuitBroken
errors from canary for the shard we're blocking.
- We should get
References
- Making changes to GitLab.com
- Infrastructure links
- On-Call Log
- Blameless Postmortems Guideline
- Monitoring
/label change
Summary of the experiment
The behaviour was as expected at first.
- 503 errors on projects being blocked
- Pages with information in cache and unrelated projects were accessible
- Gitaly being unreachable was also handled correctly (I suspect most of the access was going through Gitaly already)
Suddenly we started seeing 500-errors, I haven't found those in Sentry.
We could not see the status of the circuitbreaker itself (because of https://gitlab.com/gitlab-com/infrastructure/issues/3302), and we need more insight in the circuitbreaker timings. Those are currently only accessible in Prometheus when Unicorn scraping is enabled. We need to add https://canary.gitlab.com/gitlab-org/gitlab-ce/issues/39698 for the next experiment.
Until https://canary.gitlab.com/gitlab-com/infrastructure/issues/3302 & https://canary.gitlab.com/gitlab-org/gitlab-ce/issues/39698 are solved, I don't think we should enable this in production.
Details available in this Google doc: https://docs.google.com/document/d/1Hk9JuhU81biPbXNU4INNPCR5qI3hWxVLcva9MGgk3Hs/edit#