Validate behaviour during a database failover scenario
Problem
During a database failover scenario, the database server might be unreachable (connection refused or timeout) for a short period of time. During this time, the registry should handle connection failures gracefully.
Expected Behaviour
The registry should discard broken database connections upfront and handle them gracefully, returning an internal server error to clients as a response to any HTTP API requests that require a database connection, logging a proper error message, and reporting the error to Sentry.
New connections should be attempted for each subsequent request. Once the database server is up again, the registry should establish new connections and resume normal operation.
Solution
Create integration tests to assert the expected behavior. I have tested this manually, and the underlying database driver handles broken connections gracefully, and so does the registry, but we must prove it with tests.
We should use a programmable TCP proxy to simulate network and system conditions between the registry and the database server. I recommend Shopify's Toxiproxy, using its native Go client.