Some Docker Gitlab Runner jobs fail at 4:20 AM

Summary

I run a scheduled Gitlab runner every night. Some of the jobs started at 4:20 AM fail - every night. However, not all of them fail. And its not always the same jobs failing and not always the same number of jobs failing. The error message (in journalctl) says:

ERROR: Job failed (system failure): Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? (executor_docker.go:837:120s) duration=3m29.686047725s

Steps to reproduce

Configure a scheduled pipeline to run every day at 4 AM. Make sure to schedule enough jobs so that the runner will need at least until 4:20 AM. In my case, there are 200 Jobs that do basically nothing but echoing a line to the consolve. The .gitlab-ci.yml looks like this:

.gitlab-ci.yml
stages:
  - build

.testjob: &testjob
  image: maven:3-jdk-8
  stage: build
  script:
    - echo DNS is working
    
Job1:
  <<: *testjob
  
Job2:
  <<: *testjob
  
Job3:
  <<: *testjob
  
Job4:
  <<: *testjob
  
Job5:
  <<: *testjob
 
Job6:
  <<: *testjob
  
Job7:
  <<: *testjob
  
Job8:
  <<: *testjob
  
Job9:
  <<: *testjob
  
Job10:
  <<: *testjob
  
Job11:
  <<: *testjob
  
Job12:
  <<: *testjob
  
Job13:
  <<: *testjob
  
Job14:
  <<: *testjob
  
Job15:
  <<: *testjob
  
Job16:
  <<: *testjob

Job17:
  <<: *testjob

Job18:
  <<: *testjob

Job19:
  <<: *testjob

Job20:
  <<: *testjob

Job21:
  <<: *testjob

Job22:
  <<: *testjob

Job23:
  <<: *testjob

Job24:
  <<: *testjob

Job25:
  <<: *testjob

Job26:
  <<: *testjob

Job27:
  <<: *testjob

Job28:
  <<: *testjob

Job29:
  <<: *testjob

Job30:
  <<: *testjob

Job31:
  <<: *testjob

Job32:
  <<: *testjob

Job33:
  <<: *testjob

Job34:
  <<: *testjob

Job35:
  <<: *testjob

Job36:
  <<: *testjob

Job37:
  <<: *testjob

Job38:
  <<: *testjob

Job39:
  <<: *testjob

Job40:
  <<: *testjob

Job41:
  <<: *testjob

Job42:
  <<: *testjob

Job43:
  <<: *testjob

Job44:
  <<: *testjob

Job45:
  <<: *testjob

Job46:
  <<: *testjob

Job47:
  <<: *testjob

Job48:
  <<: *testjob

Job49:
  <<: *testjob

Job50:
  <<: *testjob

Job51:
  <<: *testjob

Job52:
  <<: *testjob

Job53:
  <<: *testjob

Job54:
  <<: *testjob

Job55:
  <<: *testjob

Job56:
  <<: *testjob

Job57:
  <<: *testjob

Job58:
  <<: *testjob

Job59:
  <<: *testjob

Job60:
  <<: *testjob

Job61:
  <<: *testjob

Job62:
  <<: *testjob

Job63:
  <<: *testjob

Job64:
  <<: *testjob

Job65:
  <<: *testjob

Job66:
  <<: *testjob

Job67:
  <<: *testjob

Job68:
  <<: *testjob

Job69:
  <<: *testjob

Job70:
  <<: *testjob

Job71:
  <<: *testjob

Job72:
  <<: *testjob

Job73:
  <<: *testjob

Job74:
  <<: *testjob

Job75:
  <<: *testjob

Job76:
  <<: *testjob

Job77:
  <<: *testjob

Job78:
  <<: *testjob

Job79:
  <<: *testjob

Job80:
  <<: *testjob

Job81:
  <<: *testjob

Job82:
  <<: *testjob

Job83:
  <<: *testjob

Job84:
  <<: *testjob

Job85:
  <<: *testjob

Job86:
  <<: *testjob

Job87:
  <<: *testjob

Job88:
  <<: *testjob

Job89:
  <<: *testjob

Job90:
  <<: *testjob

Job91:
  <<: *testjob

Job92:
  <<: *testjob

Job93:
  <<: *testjob

Job94:
  <<: *testjob

Job95:
  <<: *testjob

Job96:
  <<: *testjob

Job97:
  <<: *testjob

Job98:
  <<: *testjob

Job99:
  <<: *testjob

Job100:
  <<: *testjob

Job101:
  <<: *testjob

Job102:
  <<: *testjob

Job103:
  <<: *testjob

Job104:
  <<: *testjob

Job105:
  <<: *testjob

Job106:
  <<: *testjob

Job107:
  <<: *testjob

Job108:
  <<: *testjob

Job109:
  <<: *testjob

Job110:
  <<: *testjob

Job111:
  <<: *testjob

Job112:
  <<: *testjob

Job113:
  <<: *testjob

Job114:
  <<: *testjob

Job115:
  <<: *testjob

Job116:
  <<: *testjob

Job117:
  <<: *testjob

Job118:
  <<: *testjob

Job119:
  <<: *testjob

Job120:
  <<: *testjob

Job121:
  <<: *testjob

Job122:
  <<: *testjob

Job123:
  <<: *testjob

Job124:
  <<: *testjob

Job125:
  <<: *testjob

Job126:
  <<: *testjob

Job127:
  <<: *testjob

Job128:
  <<: *testjob

Job129:
  <<: *testjob

Job130:
  <<: *testjob

Job131:
  <<: *testjob

Job132:
  <<: *testjob

Job133:
  <<: *testjob

Job134:
  <<: *testjob

Job135:
  <<: *testjob

Job136:
  <<: *testjob

Job137:
  <<: *testjob

Job138:
  <<: *testjob

Job139:
  <<: *testjob

Job140:
  <<: *testjob

Job141:
  <<: *testjob

Job142:
  <<: *testjob

Job143:
  <<: *testjob

Job144:
  <<: *testjob

Job145:
  <<: *testjob

Job146:
  <<: *testjob

Job147:
  <<: *testjob

Job148:
  <<: *testjob

Job149:
  <<: *testjob

Job150:
  <<: *testjob

Job151:
  <<: *testjob

Job152:
  <<: *testjob

Job153:
  <<: *testjob

Job154:
  <<: *testjob

Job155:
  <<: *testjob

Job156:
  <<: *testjob

Job157:
  <<: *testjob

Job158:
  <<: *testjob

Job159:
  <<: *testjob

Job160:
  <<: *testjob

Job161:
  <<: *testjob

Job162:
  <<: *testjob

Job163:
  <<: *testjob

Job164:
  <<: *testjob

Job165:
  <<: *testjob

Job166:
  <<: *testjob

Job167:
  <<: *testjob

Job168:
  <<: *testjob

Job169:
  <<: *testjob

Job170:
  <<: *testjob

Job171:
  <<: *testjob

Job172:
  <<: *testjob

Job173:
  <<: *testjob

Job174:
  <<: *testjob

Job175:
  <<: *testjob

Job176:
  <<: *testjob

Job177:
  <<: *testjob

Job178:
  <<: *testjob

Job179:
  <<: *testjob

Job180:
  <<: *testjob

Job181:
  <<: *testjob

Job182:
  <<: *testjob

Job183:
  <<: *testjob

Job184:
  <<: *testjob

Job185:
  <<: *testjob

Job186:
  <<: *testjob

Job187:
  <<: *testjob

Job188:
  <<: *testjob

Job189:
  <<: *testjob

Job190:
  <<: *testjob

Job191:
  <<: *testjob

Job192:
  <<: *testjob

Job193:
  <<: *testjob

Job194:
  <<: *testjob

Job195:
  <<: *testjob

Job196:
  <<: *testjob

Job197:
  <<: *testjob

Job198:
  <<: *testjob

Job199:
  <<: *testjob

Job200:
  <<: *testjob

Actual behavior

Some - but not all and not always the same - jobs fail.

Expected behavior

All jobs should succeed.

Relevant logs and/or screenshots

job log
Running with gitlab-runner 12.1.0 (de7731dd)
  on XXXX XXXXXXX
Using Docker executor with image maven:3-jdk-8 ...
Pulling docker image maven:3-jdk-8 ...
Using docker image sha256:2fa604c5c53b6adaeab11550b71048693a4b018a0e7ac6a93af87b09b25100af for maven:3-jdk-8 ...
Running on runner-XXXXXXXX-project-32509-concurrent-0 via XXXX...
Fetching changes with git depth set to 50...
Reinitialized existing Git repository in /builds/se-arc/gitlab-runner-test/.git/
Checking out 5d7f6d66 as master...

Skipping Git submodules setup
ERROR: Job failed (system failure): Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? (executor_docker.go:837:120s)

Environment description

config.toml contents
[[runners]]                                                            
  name = "XXXX"                                                 
  url = "https://XXXXXXXXXXXXX.de/"                                  
  token = "XXXXXXXXXXXXXXXXX"                                       
  executor = "docker"                                                  
  [runners.docker]                                                     
    tls_verify = false                                                 
    image = "alpine:latest"                                            
    privileged = false                                                 
    disable_entrypoint_overwrite = false                               
    oom_kill_disable = false                                           
    disable_cache = false                                              
    volumes = ["/var/run/docker.sock:/var/run/docker.sock", "/cache"]  
    extra_hosts = ["XXXXXXXXXXXX:XXX.XXX.XXX.XXX"]                
    shm_size = 0                                                       
  [runners.cache]                                                      
    [runners.cache.s3]                                                 
    [runners.cache.gcs]                                                
The volume `"/var/run/docker.sock:/var/run/docker.sock"` was added in an attempt to fix the problem. The problem occurs with and without this volume.

Used GitLab Runner version

Version:      12.1.0
Git revision: de7731dd
Git branch:   12-1-stable
GO version:   go1.8.7
Built:        2019-07-19T13:53:04+0000
OS/Arch:      linux/amd64

Possible fixes

According to the error message, the error was thrown by executor_docker.go:837:120s. However, I neither know how this is related to the bug nor any workarounds.

Assignee Loading
Time tracking Loading