SAST job is timing out
Description
As can be seen here https://gitlab.com/gitlab-org/gitlab-ce/-/jobs/201117795
For sast
it seems we pull a few Docker images:
-
Status: Downloaded newer image for registry.gitlab.com/gitlab-org/security-products/sast:11-10-stable
=>5.38 MiB
-
Status: Downloaded newer image for registry.gitlab.com/gitlab-org/security-products/analyzers/brakeman:latest
=>334.75 MiB
-
Status: Downloaded newer image for registry.gitlab.com/gitlab-org/security-products/analyzers/nodejs-scan:latest
=>29.82 MiB
-
Status: Downloaded newer image for registry.gitlab.com/gitlab-org/security-products/analyzers/eslint:latest
=>34.46 MiB
-
Status: Downloaded newer image for registry.gitlab.com/gitlab-org/security-products/analyzers/secrets:latest
=>41.65 MiB
And then this last analyzer seems to time out somehow (from https://gitlab.com/gitlab-org/security-products/analyzers/common/blob/b22938f24b108a128bc8bb3ffe278f87a4bedca9/orchestrator/orchestrator.go#L155):
2019/04/23 19:29:35 Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
2019/04/23 19:29:36 Timeout when running the analyzer
Workaround
One can increase the timeout the SAST uses when running the analyzer image by using the SAST_RUN_ANALYZER_TIMEOUT
environment variable (the phase where sast
job spends its time hanging is the "run analyzer" phase).
Here's how to utilize it in .gitlab-ci.yml
config file:
include:
template: SAST.gitlab-ci.yml
sast:
variables:
SAST_RUN_ANALYZER_TIMEOUT: 1h
Remember though that the total job run time cannot be more than the job timeout set in project or runner configuration.
Status
Investigation accomplished.
It has become clear that the problem is highly likely the disk IO contention on the runner nodes that execute SAST jobs. The full research results, conclusions, and possible mitigation actions are presented below.
Current findings
- This bug is 100% reproducible on some commits, I've snapshotted a couple of them as branches (example) and running my investigations over them; why some commits cause this timeout while others don't - still unclear
- The problem is 100% related to
secrets
analyzer - running SAST havingsecrets
removed fromSAST_DEFAULT_ANALYZERS
always succeeds. - I've enhanced test versions of common library and
secrets
analyzer withlog.Printf
statements which have proved that container is created and started successfully - the error happens in the waiting for its completion (SAST branch that uses enhanced dependencies). - But, the logs from
secrets
analyzer haven't got copied to the SAST output stream. So, I'm currently unable to tell from logs whatsecrets
analysis phase causes the timeout. - Running on my local machine (MacBook Pro RAM: 16GB, CPU: 2.2 GHz Intel Core i7) against the problematic branch, it took 1 minute from
ContainerStart
to displaying the results ofsecrets
analyzer run.sast
job went successful. - I'm concerned about the gap between
ContainerStart
andContainerCreate
: at GitLab.com forgitlab-ce
it takes about 2 minutes. Maybe the delays are connected to the general load on ourdocker
infrastructure🤔 . - Managed to run the job till the end by setting 100 minutes for analyzer container run timeout.
- Tried running logging-enhanced
secrets
logger against one of the "green commits" and try finding out why it doesn't fail for any of them and found out that the problem does not lie in "broken"/"correct" source code in the branch but is related to the runner node load during the job execution (CPU load, IOPS etc.) - Thanks to Ben Kochie from ~Monitor team, we've managed to find out that the problem is highly likely the disk IO contention on the runner nodes that execute SAST jobs. This graph shows that the runner node CPU spends most of its time in
iowait
mode. - According to the Infra team (Slack message, internal),
We don't do any specific configuration of disks. We're using GCP VMs for hosting the containers where jobs are being executed. The only difference is that for
gitlab-shared-runners-manager-X
andprivate-runners-manager-X
autoscaled machines we're using SSD, while forshared-runners-managers-X
autoscaled machines we're using GCP's default.
- Spotted a SAST job that was executed successfully against EE at
2019-06-06 8:54PM UTC
andsecrets
analyzer did its job pretty quick (in around 2 minutes). This just was run byprivate-runners-manager-3.gitlab.com (#395503)
which in turn (as stated in previous item) is powered by an SSD drive. Could be another small clue to understand of the issue cause. - Held a tiny research trying to locate the source of the I/O bottleneck and study the jobs execution under different types of runner nodes.
- Suggested possible mitigation actions for the issue.