When pages has an update, deployer appears to stop and fail the job
In the latest deploy: https://ops.gitlab.net/gitlab-com/gl-infra/deployer/pipelines/107586, the deploy for production web-pages is suffering:
- https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/jobs/847606
- https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/jobs/847047
- https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/jobs/847737
We have 3 total changes in these jobs:
- An update to pages: gitlab-org/gitlab!23023 (merged)
- An update to ansible: https://ops.gitlab.net/gitlab-com/gl-infra/deployer/merge_requests/222
- An update to the ansible configuration: https://ops.gitlab.net/gitlab-com/gl-infra/deploy-tooling/merge_requests/228
The behavior of the job is a little odd, as they are ending in the middle of some log output, and ansible wasn't even halfway done through the checking procedure we have in place. This particular job is the most stressful one as it takes the longest due to the nature of pages taking an extraordinarily long time to boot up. I do not suspect the issue is with the pages update. During the investigation into the first deploy failure, the healthcheck eventually started to pass and the node was healthy and back in rotation prior to being able to complete the investigation. The behavior of pages taking a long time to start up is not new to us. Because of this, I suspect that one of these ansible changes might be responsible.