Simulate primary database server failure (staging)
Context
This is part of the work to upgrade and migrate the GitLab.com container registry to a new version backed by a metadata database and online garbage collection (&5523 (closed)). This will be achieved following the gradual migration plan detailed in container-registry#374 (closed).
Before moving to production we will test a series of failure scenarios that were previously identified and documented in gitlab-com/runbooks!3628 (merged).
Scenario
Primary database server failure (e.g. goes offline or becomes unresponsive).
Expectations
Impact
- API unable to serve requests.
- GC unable to process tasks.
Behavior
- API and GC handle refused or timed out database connections gracefully;
- Connections are retried once (at the database driver level). In case of failure, connections are discarded and requests halted with a
503 Service Unavailable
response; - A new request leads to a new connection attempt.
Observability
- Errors show up in Sentry and logs;
- Grafana dashboards reflect the impact scale.
Recovery
API and GC resume operations normally, without external intervention, once the server becomes operational.
Target Environment
This should be tested in staging as that is the only environment with a custom PostgreSQL cluster like we will find in production.