Postgres backup restore pipelines (gitlab-restore): on failures, show errors in CI/CD pipeline output
When we have some failures in the gitlab-restore pipelines, quite often we cannot see the error in the pipeline output. Example: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15668, the "verify" job has only this error:
ERROR: (gcloud.compute.instances.delete) Failed to fetch some instances:
- The resource 'projects/gitlab-restore/zones/us-west1-a/instances/restore-postgres-gprd-1184314' was not found
ERROR: Job failed: exit code 1
– which is not helpful for RCA.
This happens because the restore in such cases exceeds the CI timeout set. In gitlab-restore, we have CI/CD timeout 6h, quite big, but restoration of gprd DB takes even longer.
What happens, step by step:
- CI job "restore" created an instance and uses a startup script (
bootstrap.sh
) that encapsulates the logic of PGDATA creation and backup retrieval - if
bootstrap.sh
finishes successfully, theverify_callback
is executed triggering the nest CI job, "verify" https://ops.gitlab.net/gitlab-com/gl-infra/gitlab-restore/postgres-gprd/-/blob/master/bootstrap.sh#L81 - if, during
bootstrap.sh
execution, some error occurs (such as out of disk space in the example above), the instance is cleaned up automatically and we lose useful logs/error messages
This makes it very difficult to troubleshoot. Now, if we have failures, we need to run a new pipeline, specifying NO_CLEANUP=1
to keep the instance if an error occurs – and then connect to it manually and investigate (runbook: https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/patroni/postgresql-backups-wale-walg.md#database-backups-restore-testing). It may take 10+ hours (if an error occurs only in the end of DB restoration – exactly like in the cases when disk size is slightly not enough), inconvenient and time-consuming.
The idea of this issue is to find and implement a way to deliver error messages to the CI pipeline's output somehow. I see two possible paths for
- there is
report_failure
that is triggered on an error inbootstrap.sh
already, it is used for alerting, we could add some logic there; we just need to find a way to deliver error messages (plus, maybe additional diagnostics suck as the tail of Postgres log,df
, syslog tail, etc.) to CI output - we could avoid auto-deletion of failed instances and allow them to live some short period of time – say, 1 day, adjusting the auto-cleanup logic here: https://ops.gitlab.net/gitlab-com/gl-infra/gitlab-restore/postgres-gprd/-/blob/master/verify_and_clean.sh#L17. Pros: we have the whole machine in hand to investigate; Cons: this would increase the bill for gitlab-restore if we have failures often and we still need to connect to machine to see errors (unless we find a way to duplicate them in CI)