SAST job is timing out
Description
As can be seen here https://gitlab.com/gitlab-org/gitlab-ce/-/jobs/201117795
For sast it seems we pull a few Docker images:
-
Status: Downloaded newer image for registry.gitlab.com/gitlab-org/security-products/sast:11-10-stable=>5.38 MiB -
Status: Downloaded newer image for registry.gitlab.com/gitlab-org/security-products/analyzers/brakeman:latest=>334.75 MiB -
Status: Downloaded newer image for registry.gitlab.com/gitlab-org/security-products/analyzers/nodejs-scan:latest=>29.82 MiB -
Status: Downloaded newer image for registry.gitlab.com/gitlab-org/security-products/analyzers/eslint:latest=>34.46 MiB -
Status: Downloaded newer image for registry.gitlab.com/gitlab-org/security-products/analyzers/secrets:latest=>41.65 MiB
And then this last analyzer seems to time out somehow (from https://gitlab.com/gitlab-org/security-products/analyzers/common/blob/b22938f24b108a128bc8bb3ffe278f87a4bedca9/orchestrator/orchestrator.go#L155):
2019/04/23 19:29:35 Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
2019/04/23 19:29:36 Timeout when running the analyzer
Workaround
One can increase the timeout the SAST uses when running the analyzer image by using the SAST_RUN_ANALYZER_TIMEOUT environment variable (the phase where sast job spends its time hanging is the "run analyzer" phase).
Here's how to utilize it in .gitlab-ci.yml config file:
include:
template: SAST.gitlab-ci.yml
sast:
variables:
SAST_RUN_ANALYZER_TIMEOUT: 1h
Remember though that the total job run time cannot be more than the job timeout set in project or runner configuration.
Status
Investigation accomplished.
It has become clear that the problem is highly likely the disk IO contention on the runner nodes that execute SAST jobs. The full research results, conclusions, and possible mitigation actions are presented below.
Current findings
- This bug is 100% reproducible on some commits, I've snapshotted a couple of them as branches (example) and running my investigations over them; why some commits cause this timeout while others don't - still unclear
- The problem is 100% related to
secretsanalyzer - running SAST havingsecretsremoved fromSAST_DEFAULT_ANALYZERSalways succeeds. - I've enhanced test versions of common library and
secretsanalyzer withlog.Printfstatements which have proved that container is created and started successfully - the error happens in the waiting for its completion (SAST branch that uses enhanced dependencies). - But, the logs from
secretsanalyzer haven't got copied to the SAST output stream. So, I'm currently unable to tell from logs whatsecretsanalysis phase causes the timeout. - Running on my local machine (MacBook Pro RAM: 16GB, CPU: 2.2 GHz Intel Core i7) against the problematic branch, it took 1 minute from
ContainerStartto displaying the results ofsecretsanalyzer run.sastjob went successful. - I'm concerned about the gap between
ContainerStartandContainerCreate: at GitLab.com forgitlab-ceit takes about 2 minutes. Maybe the delays are connected to the general load on ourdockerinfrastructure🤔 . - Managed to run the job till the end by setting 100 minutes for analyzer container run timeout.
- Tried running logging-enhanced
secretslogger against one of the "green commits" and try finding out why it doesn't fail for any of them and found out that the problem does not lie in "broken"/"correct" source code in the branch but is related to the runner node load during the job execution (CPU load, IOPS etc.) - Thanks to Ben Kochie from ~Monitor team, we've managed to find out that the problem is highly likely the disk IO contention on the runner nodes that execute SAST jobs. This graph shows that the runner node CPU spends most of its time in
iowaitmode. - According to the Infra team (Slack message, internal),
We don't do any specific configuration of disks. We're using GCP VMs for hosting the containers where jobs are being executed. The only difference is that for
gitlab-shared-runners-manager-Xandprivate-runners-manager-Xautoscaled machines we're using SSD, while forshared-runners-managers-Xautoscaled machines we're using GCP's default.
- Spotted a SAST job that was executed successfully against EE at
2019-06-06 8:54PM UTCandsecretsanalyzer did its job pretty quick (in around 2 minutes). This just was run byprivate-runners-manager-3.gitlab.com (#395503)which in turn (as stated in previous item) is powered by an SSD drive. Could be another small clue to understand of the issue cause. - Held a tiny research trying to locate the source of the I/O bottleneck and study the jobs execution under different types of runner nodes.
- Suggested possible mitigation actions for the issue.