Proposals to improve stable branch pipelines' reliability
Summary
The rspec tests usually cleanup their test environments after they finish, be that after success or failure. This is to guarantee that multiple test executions are idempotent and reliable.
Not everything is cleanup/reset automatically by rspec, for instance, resources created on a Kubernetes Cluster.
For stable branches we intentionally don't destroy our test environments, because we want to have long-lived environments where we'd like to test chart upgrades scenario. Although, when maintaining a long-lived environment, we're affected by Day-2 operations and concerns that are not simply related to testing a chart upgrade. Some Day-2 common problems that we often see:
- Our MinIO PVs are full. Example: #5080 (comment 1638129548).
- Our Prometheus PVs are full. Example: #5138 (comment 1673534352).
- We run backup/restore tests on our pipelines. If these tests fail for whatever reason, they might bork our long-lived environment, and not necessarily it is due to a backup/restore logic problem. For instance, it could be lack of cluster memory resources which delayed the kubectl command actions we send to the cluster, and the specs failed with timeout waiting for a certain operation.
- When this spec fails and brings the environment to an unexpected state, it will make subsequent pipelines fail for other reasons, that could otherwise pass if the environment had been reset. This recent issue covers some of the problems that we describe here: #5138 (closed)
- Our Kubernetes auto-scaler scales down the pods, but our specs rely on pod names. When a pod is scaled down, the test fails as a new pod name is given which the test does not know about: Make spec/features/backups_spec.rb more robust ... (#5002)
Proposals
Detect spec failures and run a cleanup script if so
Update our backup spec to include something like:
before(:all) do
@exceptions = []
end
after(:each) do |example|
@exceptions << example.exception
end
after(:all) do |a|
# cleanup_logic would execute something like our usual mitigation steps:
# - reseting the Postegres Volume and Pod
# - reseting the Prometheus and MinIO Volumes
# - Killing pods in CrashLoopBackOff state (Runner and Prometheus are common cases)
# - Re-triggering a new "deploy environment job"
# - Something else?
cleanup_logic if @exceptions.any?
end
Improve our Day-2 tooling to
- Get an alert/warning when our Prometheus and MinIO volumes start to get out of space.
- Get an alert/warning when our Clusters get low on memory and need to be manually scaled up.
- Run a separate job to scan our release and automatically fix potential known problems.
- Run a separate job periodically clean up to scan our release and fix potential known problems.
Other proposals?
- TBD
Edited by João Alexandre Cunha