Simulate primary database server failure (staging)

Context

This is part of the work to upgrade and migrate the GitLab.com container registry to a new version backed by a metadata database and online garbage collection (&5523 (closed)). This will be achieved following the gradual migration plan detailed in container-registry#374 (closed).

Before moving to production we will test a series of failure scenarios that were previously identified and documented in gitlab-com/runbooks!3628 (merged).

Scenario

Primary database server failure (e.g. goes offline or becomes unresponsive).

Expectations

Impact

  • API unable to serve requests.
  • GC unable to process tasks.

Behavior

  • API and GC handle refused or timed out database connections gracefully;
  • Connections are retried once (at the database driver level). In case of failure, connections are discarded and requests halted with a 503 Service Unavailable response;
  • A new request leads to a new connection attempt.

Observability

  • Errors show up in Sentry and logs;
  • Grafana dashboards reflect the impact scale.

Recovery

API and GC resume operations normally, without external intervention, once the server becomes operational.

Target Environment

This should be tested in staging as that is the only environment with a custom PostgreSQL cluster like we will find in production.

Edited by João Pereira