SAST job is timing out

Description

As can be seen here https://gitlab.com/gitlab-org/gitlab-ce/-/jobs/201117795

For sast it seems we pull a few Docker images:

  • Status: Downloaded newer image for registry.gitlab.com/gitlab-org/security-products/sast:11-10-stable => 5.38 MiB
  • Status: Downloaded newer image for registry.gitlab.com/gitlab-org/security-products/analyzers/brakeman:latest => 334.75 MiB
  • Status: Downloaded newer image for registry.gitlab.com/gitlab-org/security-products/analyzers/nodejs-scan:latest => 29.82 MiB
  • Status: Downloaded newer image for registry.gitlab.com/gitlab-org/security-products/analyzers/eslint:latest => 34.46 MiB
  • Status: Downloaded newer image for registry.gitlab.com/gitlab-org/security-products/analyzers/secrets:latest => 41.65 MiB

And then this last analyzer seems to time out somehow (from https://gitlab.com/gitlab-org/security-products/analyzers/common/blob/b22938f24b108a128bc8bb3ffe278f87a4bedca9/orchestrator/orchestrator.go#L155):

2019/04/23 19:29:35 Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
2019/04/23 19:29:36 Timeout when running the analyzer

Workaround

One can increase the timeout the SAST uses when running the analyzer image by using the SAST_RUN_ANALYZER_TIMEOUT environment variable (the phase where sast job spends its time hanging is the "run analyzer" phase).

Here's how to utilize it in .gitlab-ci.yml config file:

include:
  template: SAST.gitlab-ci.yml

sast:
  variables:
    SAST_RUN_ANALYZER_TIMEOUT: 1h

Remember though that the total job run time cannot be more than the job timeout set in project or runner configuration.

Status

Investigation accomplished.

It has become clear that the problem is highly likely the disk IO contention on the runner nodes that execute SAST jobs. The full research results, conclusions, and possible mitigation actions are presented below.

Current findings

  1. This bug is 100% reproducible on some commits, I've snapshotted a couple of them as branches (example) and running my investigations over them; why some commits cause this timeout while others don't - still unclear
  2. The problem is 100% related to secrets analyzer - running SAST having secrets removed from SAST_DEFAULT_ANALYZERS always succeeds.
  3. I've enhanced test versions of common library and secrets analyzer with log.Printf statements which have proved that container is created and started successfully - the error happens in the waiting for its completion (SAST branch that uses enhanced dependencies).
  4. But, the logs from secrets analyzer haven't got copied to the SAST output stream. So, I'm currently unable to tell from logs what secrets analysis phase causes the timeout.
  5. Running on my local machine (MacBook Pro RAM: 16GB, CPU: 2.2 GHz Intel Core i7) against the problematic branch, it took 1 minute from ContainerStart to displaying the results of secrets analyzer run. sast job went successful.
  6. I'm concerned about the gap between ContainerStart and ContainerCreate: at GitLab.com for gitlab-ce it takes about 2 minutes. Maybe the delays are connected to the general load on our docker infrastructure 🤔.
  7. Managed to run the job till the end by setting 100 minutes for analyzer container run timeout.
  8. Tried running logging-enhanced secrets logger against one of the "green commits" and try finding out why it doesn't fail for any of them and found out that the problem does not lie in "broken"/"correct" source code in the branch but is related to the runner node load during the job execution (CPU load, IOPS etc.)
  9. Thanks to Ben Kochie from ~Monitor team, we've managed to find out that the problem is highly likely the disk IO contention on the runner nodes that execute SAST jobs. This graph shows that the runner node CPU spends most of its time in iowait mode.
  10. According to the Infra team (Slack message, internal),

We don't do any specific configuration of disks. We're using GCP VMs for hosting the containers where jobs are being executed. The only difference is that for gitlab-shared-runners-manager-X and private-runners-manager-X autoscaled machines we're using SSD, while for shared-runners-managers-X autoscaled machines we're using GCP's default.

  1. Spotted a SAST job that was executed successfully against EE at 2019-06-06 8:54PM UTC and secrets analyzer did its job pretty quick (in around 2 minutes). This just was run by private-runners-manager-3.gitlab.com (#395503) which in turn (as stated in previous item) is powered by an SSD drive. Could be another small clue to understand of the issue cause.
  2. Held a tiny research trying to locate the source of the I/O bottleneck and study the jobs execution under different types of runner nodes.
  3. Suggested possible mitigation actions for the issue.
Edited by Victor Zagorodny