We recently noticed increased error rate related to mysterious code 137 error in GitLab QA pipeline. It is probably related to OOM killer and docker engine. I suspect that we reuse GCP machines a little too often, thus we run out of memory, but this needs further investigation.
For the time being I'm going to add retry: 1 to QA tests, but this is going to be a temporary solution, and we need to find the root cause and resolve it.
Grzegorz Bizonchanged title from Reduce intermittent errors related to lack of memory (code 137) to Intermittent errors related to lack of memory (code 137)
changed title from Reduce intermittent errors related to lack of memory (code 137) to Intermittent errors related to lack of memory (code 137)
This is what I came to know from checking the runner config and inspecting a running machine
The issue seems to hit two runners specifically
build-trigger-runner-manager-gitlab-org that QA uses
package-promotion-runner that omnibus-gitlab uses to upload packages to staging
Both of these runners have coreos-stable-1520-8-0-v20171026 as google-machine-image in MachineOptions. This is a pretty old coreos version.
While inspecting the running machine, both @marin and me saw a broadcast message from locksmithd, which is the reboot manager for coreos, that a reboot will happen in 5 minutes.
coreos has an auto-updater feature. So while we are running our jobs, this auto-updater is running in background. It initiates a reboot as it wishes. So, this reboot happens during the middle of our job. This results in the 137 error message.
gitlab-omnibus-builder!77 (merged) adds a stage to do this. This MR is in gitlab-omnibus-builder project because both the runners in question are associated with same GCP project. So we already have the infra in position.
In addition to the above MR, the fix requires updates to our runner configuration roles in cookbooks to update google-machine-image to point to this new image we create.