Enable NFS circuitbreaker in production

What are we going to do?

We want to test the behavior of the circuit-breaker when a storage shard is failing in a production like environment.

We chose to try this on canary because it has several shards available, with viewable data.

To try this out we will set the GIT_STORAGE_CIRCUIT_BREAKER env variable to true in the canary fleet, and then block access to a shard using iptables.

Why are we doing it?

To see if the circuitbreaker correctly blocks access to a failing shard, keeping projects on different shards accessible.

To know the real world performance overhead of the circuit-breaker.

When are we going to do it?

TBD

Start time: 2017-11-25 19:00 UTC
Duration: 3h
End time: 22:00 UTC

How are we going to do it?

Enable the circuit-breaker by setting the environment variable: See this MR
Run a script to temporarily block access to a certain shard.
Disable the script that blocks access to a storage shard
Disable the circuit-breaker by removing the env variable GIT_STORAGE_CIRCUIT_BREAKER

How are we preparing for it?

We need to write a script that blocks access to a shard.

What can we check before starting?

We need to know what the transaction timings are before starting to determine the influence of the circuit-breaker.

What can we check afterwards to ensure that it's working?

Accessing projects that are not on the disabled shard should work without delay
Accessing projects on the disabled shard should render a 503 immediately.

Impact

Type of impact: internal
What will happen: We don't expect any impact in GitLab.com during this test. But this might cause extra IO.
Do we expect downtime? (set the override in pagerduty): No

How are we communicating this to our customers?

This had no influence on GitLab.com

What is the rollback plan?

Removing the env variable will bring canary back to normal.
Stop the script that's blocking access to a shard

Monitoring

Graphs to check for failures:
- https://prometheus.gitlab.com/graph?g0.range_input=1h&g0.expr=filesystem_circuitbreaker_latency_seconds%7Benvironment%3D%27cny%27%7D&g0.tab=0 (Scraping disabled)
Alerts that may trigger:
- We should get Inaccessible & CircuitBroken errors from canary for the shard we're blocking.

References

/label change

Summary of the experiment

The behaviour was as expected at first.

503 errors on projects being blocked
Pages with information in cache and unrelated projects were accessible
Gitaly being unreachable was also handled correctly (I suspect most of the access was going through Gitaly already)

Suddenly we started seeing 500-errors, I haven't found those in Sentry.

We could not see the status of the circuitbreaker itself (because of https://gitlab.com/gitlab-com/infrastructure/issues/3302), and we need more insight in the circuitbreaker timings. Those are currently only accessible in Prometheus when Unicorn scraping is enabled. We need to add https://canary.gitlab.com/gitlab-org/gitlab-ce/issues/39698 for the next experiment.

Until https://canary.gitlab.com/gitlab-com/infrastructure/issues/3302 & https://canary.gitlab.com/gitlab-org/gitlab-ce/issues/39698 are solved, I don't think we should enable this in production.

Details available in this Google doc: https://docs.google.com/document/d/1Hk9JuhU81biPbXNU4INNPCR5qI3hWxVLcva9MGgk3Hs/edit#

Edited Nov 28, 2017 by Bob Van Landuyt