Postgres backup restore pipelines (gitlab-restore): on failures, show errors in CI/CD pipeline output

When we have some failures in the gitlab-restore pipelines, quite often we cannot see the error in the pipeline output. Example: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15668, the "verify" job has only this error:

ERROR: (gcloud.compute.instances.delete) Failed to fetch some instances:
 - The resource 'projects/gitlab-restore/zones/us-west1-a/instances/restore-postgres-gprd-1184314' was not found
ERROR: Job failed: exit code 1

– which is not helpful for RCA.

This happens because the restore in such cases exceeds the CI timeout set. In gitlab-restore, we have CI/CD timeout 6h, quite big, but restoration of gprd DB takes even longer.

What happens, step by step:

CI job "restore" created an instance and uses a startup script (bootstrap.sh) that encapsulates the logic of PGDATA creation and backup retrieval
if bootstrap.sh finishes successfully, the verify_callback is executed triggering the nest CI job, "verify" https://ops.gitlab.net/gitlab-com/gl-infra/gitlab-restore/postgres-gprd/-/blob/master/bootstrap.sh#L81
if, during bootstrap.sh execution, some error occurs (such as out of disk space in the example above), the instance is cleaned up automatically and we lose useful logs/error messages

This makes it very difficult to troubleshoot. Now, if we have failures, we need to run a new pipeline, specifying NO_CLEANUP=1 to keep the instance if an error occurs – and then connect to it manually and investigate (runbook: https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/patroni/postgresql-backups-wale-walg.md#database-backups-restore-testing). It may take 10+ hours (if an error occurs only in the end of DB restoration – exactly like in the cases when disk size is slightly not enough), inconvenient and time-consuming.

The idea of this issue is to find and implement a way to deliver error messages to the CI pipeline's output somehow. I see two possible paths for

there is report_failure that is triggered on an error in bootstrap.sh already, it is used for alerting, we could add some logic there; we just need to find a way to deliver error messages (plus, maybe additional diagnostics suck as the tail of Postgres log, df, syslog tail, etc.) to CI output
we could avoid auto-deletion of failed instances and allow them to live some short period of time – say, 1 day, adjusting the auto-cleanup logic here: https://ops.gitlab.net/gitlab-com/gl-infra/gitlab-restore/postgres-gprd/-/blob/master/verify_and_clean.sh#L17. Pros: we have the whole machine in hand to investigate; Cons: this would increase the bill for gitlab-restore if we have failures often and we still need to connect to machine to see errors (unless we find a way to duplicate them in CI)

Edited May 04, 2022 by Nikolay Samokhvalov