Proposals to improve stable branch pipelines' reliability

Summary

The rspec tests usually cleanup their test environments after they finish, be that after success or failure. This is to guarantee that multiple test executions are idempotent and reliable.

Not everything is cleanup/reset automatically by rspec, for instance, resources created on a Kubernetes Cluster.

For stable branches we intentionally don't destroy our test environments, because we want to have long-lived environments where we'd like to test chart upgrades scenario. Although, when maintaining a long-lived environment, we're affected by Day-2 operations and concerns that are not simply related to testing a chart upgrade. Some Day-2 common problems that we often see:

Our MinIO PVs are full. Example: #5080 (comment 1638129548).
Our Prometheus PVs are full. Example: #5138 (comment 1673534352).
We run backup/restore tests on our pipelines. If these tests fail for whatever reason, they might bork our long-lived environment, and not necessarily it is due to a backup/restore logic problem. For instance, it could be lack of cluster memory resources which delayed the kubectl command actions we send to the cluster, and the specs failed with timeout waiting for a certain operation.
- When this spec fails and brings the environment to an unexpected state, it will make subsequent pipelines fail for other reasons, that could otherwise pass if the environment had been reset. This recent issue covers some of the problems that we describe here: #5138 (closed)
Our Kubernetes auto-scaler scales down the pods, but our specs rely on pod names. When a pod is scaled down, the test fails as a new pod name is given which the test does not know about: Make spec/features/backups_spec.rb more robust ... (#5002)

Proposals

Detect spec failures and run a cleanup script if so

Update our backup spec to include something like:

  before(:all) do
    @exceptions = []
  end

  after(:each) do |example|
    @exceptions << example.exception
  end

  after(:all) do |a|
    # cleanup_logic would execute something like our usual mitigation steps:
    # - reseting the Postegres Volume and Pod
    # - reseting the Prometheus and MinIO Volumes
    # - Killing pods in CrashLoopBackOff state (Runner and Prometheus are common cases)
    # - Re-triggering a new "deploy environment job"
    # - Something else?
    cleanup_logic if @exceptions.any?
  end

Improve our Day-2 tooling to

Get an alert/warning when our Prometheus and MinIO volumes start to get out of space.
Get an alert/warning when our Clusters get low on memory and need to be manually scaled up.
Run a separate job to scan our release and automatically fix potential known problems.
Run a separate job periodically clean up to scan our release and fix potential known problems.

Other proposals?

Edited Dec 04, 2023 by João Alexandre Cunha