Job timeouts not working

Overview

Support has received couple of tickets where user has mentioned that the pipelines getting stuck and consuming all the minutes. The important part of the problem is users have defined the timeout but GitLab didn't enforce it.

Ticket:

Open Issues: https://gitlab.com/gitlab-com/support-forum/issues/3402

There can be many reasons for a pipeline getting stuck but IMO it should always enforce the timeout.

Reported Jobs

https://gitlab.com/gitlab-org/gitlab-ce/-/jobs/203204467
https://gitlab.com/gitlab-org/gitlab-ee/-/jobs/216753214
https://gitlab.com/gitlab-org/gitlab-ce/-/jobs/211063165
https://gitlab.com/gitlab-org/gitlab-ee/-/jobs/209633442
https://gitlab.com/gitlab-org/gitlab-ce/-/jobs/211820906
https://gitlab.com/gitlab-org/gitlab-ee/-/jobs/215973105
232766253
241575275
233413000
227197451
227305043

Things to investigate

As suggested by @tmaczukin in #4147 (comment 186416120): In the past some of the jobs, for which we've seen the duration counted in days, there were trace patch/job update requests logged with 429 HTTP response code (which basically means that GitLab can't handle the amount of requests). We have a backoff mechanism added to the requesting client. Maybe this is somehow locking the jobs? Like: Runner tries to send the last update request after job was canceled/timeouted, but it receives 429. So it delays for a moment and repeats, again with 429. And the situation repeats (and Runner makes the delays longer and longer) until it stop sending the requests at all.

Dev log

Wednesday, September 25th 2019

I only had time for this an hour and a half but did a good understanding of this issue, and started debugging the theory proposed in #4147 (comment 186416120) but doesn't seem that plausible (not going to rule it out yet). It's very hard to tell without having a job that is in this state, becuase I can't look at logs or inside of the runner state. One possible solution is to ask infra to see if I can get access to more log that are older then 2 days, this can help.

I've also took another look at the code of how timeout/cancelation/patches are being handled and it does seem sane but of course, this doesn't tell much.

Actions plans for next time:

Find recent jobs that behaved like this
Get logs of those jobs with the help of infra
More poking around with traces to try and reproduce the issue

Thursday, September 26th 2019

I managed to reproduce this but forcing the API to send 429, waiting for the timeout to exceed and then I ended up with a timedout the job on Runner and a job still running on GitLab, but with our backoff logic, we keep pinging the same endpoint until it responds, as soon as I force the API to return the correct status code the job is canceled on GitLab.

I've sent access to request to get access for archived logs on production.

It seems like all reports seem to related to the docker+machine executor, and I only managed to reproduce this using that executor by turnoff/remove the machine that the Runner manager is using to run the job. This is a completely plausible scenario as well since a VM can be deleted at any time possible. unfortunately/fortunately I got the correct behavior although the job was running until the actual timeout was hit.

Monday, September 30th 2019

Found a failed job, started digging into the logs

Tuesday, October 1st 2019

Found one of the root causes
Implemented a fix in !1609 (merged)

Wednesday, October 2nd 2019

Found more jobs such as 308722314, looking if it's the same patterns as the others

Thursday, October 3rg 2019

Found more jobs such as 309441198, looking if it's the same pattern as the others

Log anaylists

Job ID 307405134

Runner Manager: shared-runnes-manager-6.gitlab.com

Found a new case 307405134 which has been running for 5 hours, and is in a timedout state shared-runners-manager-6.gitlab.com url=https://gitlab.com/xxx/xxx/-/jobs/307405134 state=timedout stage=download_artifacts executor_stage=docker_run duration=4h13m26.904174977s

Rails Logs:

No 429, all of the requests are 202, meaning that patch traces wasn't the issue here:

Found 1 403, whilst updating Job log:

{"time":"2019-09-30T07:24:46.564Z","severity":"INFO","duration":10.87,"db":2.73,"view":8.139999999999999,"status":403,"method":"PUT","path":"/api/v4/jobs/307405134","params":[{"key":"info","value":"{\"name\"=>\"gitlab-runner\", \"version\"=>\"12.3.0\", \"revision\"=>\"a8a019e0\", \"platform\"=>\"linux\", \"architecture\"=>\"amd64\", \"executor\"=>\"docker+machine\", \"shell\"=>\"bash\", \"features\"=>{\"variables\"=>\"[FILTERED]\", \"image\"=>nil, \"services\"=>nil, \"artifacts\"=>nil, \"cache\"=>nil, \"shared\"=>nil, \"upload_multiple_artifacts\"=>nil, \"upload_raw_artifacts\"=>nil, \"session\"=>nil, \"terminal\"=>nil, \"refspecs\"=>nil, \"masking\"=>nil, \"proxy\"=>nil}}"},{"key":"token","value":"[FILTERED]"},{"key":"state","value":"running"}],"host":"gitlab.com","remote_ip":"35.207.54.55, 10.216.1.37, 35.207.54.55","ua":"gitlab-runner 12.3.0 (12-3-stable; go1.8.7; linux/amd64)","route":"/api/:version/jobs/:id","queue_duration":7.09,"correlation_id":"8N9vqrTmb66","tag":"rails.api","environment":"gprd","hostname":"api-06-sv-gprd","fqdn":"api-06-sv-gprd.c.gitlab-production.internal","message":null}

Runner Logs:

Job seems to have been aborted/canceled but job is still running

{"code":403,"job":307405134,"job-status":"canceled","level":"warning","msg":"Submitting job to coordinator... aborted","runner":"ed2dce3a","tag":"runner","environment":"ci-prd","hostname":"shared-runners-manager-6","fqdn":"shared-runners-manager-6.gitlab.com"}

Docker machine creation:

{"created":"2019-09-30T06:58:36.228363945Z","docker":"tcp://10.142.2.217:2376","job":307405134,"level":"info","msg":"Starting docker-machine build...","name":"runner-ed2dce3a-srm-1569826716-5afeb19d","now":"2019-09-30T06:59:40.149131002Z","project":9162288,"runner":"ed2dce3a","usedcount":1,"tag":"runner","environment":"ci-prd","hostname":"shared-runners-manager-6","fqdn":"shared-runners-manager-6.gitlab.com"}

The follow logs for the machine creation:

Job with ignored timed out	Normal job that finished

docker ps on the machine:

cb80faed8e22        quay.io/gitlab/gitlab-runner-docker-cleanup:latest         "go-wrapper run"         5 weeks ago         Up 5 hours                              gitlab-runner-docker-cleanup
813ce036918e        registry.gitlab.com/gitlab-org/xxxxx/xxxx:0.3              "/sbin/init"             5 weeks ago         Up 5 hours                              xxxxxxxxxxxxx
b7cd3fd6ce04        quay.io/prometheus/node-exporter:v0.18.1                   "/bin/node_exporter …"   5 weeks ago         Up 5 hours                              node-exporter

metrics output

# HELP node_boot_time_seconds Node boot time, in unixtime.
# TYPE node_boot_time_seconds gauge
node_boot_time_seconds 1.569826727e+09
# HELP node_context_switches_total Total number of context switches.
# TYPE node_context_switches_total counter
node_context_switches_total 1.0390607e+07
# HELP node_cpu_guest_seconds_total Seconds the cpus spent in guests (VMs) for each mode.
# TYPE node_cpu_guest_seconds_total counter
node_cpu_guest_seconds_total{cpu="0",mode="nice"} 0
node_cpu_guest_seconds_total{cpu="0",mode="user"} 0
# HELP node_cpu_seconds_total Seconds the cpus spent in each mode.
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 17345.03
node_cpu_seconds_total{cpu="0",mode="iowait"} 16.79
node_cpu_seconds_total{cpu="0",mode="irq"} 33.33
node_cpu_seconds_total{cpu="0",mode="nice"} 0
node_cpu_seconds_total{cpu="0",mode="softirq"} 2.34
node_cpu_seconds_total{cpu="0",mode="steal"} 0
node_cpu_seconds_total{cpu="0",mode="system"} 70.74
node_cpu_seconds_total{cpu="0",mode="user"} 158.84
# HELP node_disk_discard_time_seconds_total This is the total number of seconds spent by all discards.
# TYPE node_disk_discard_time_seconds_total counter
node_disk_discard_time_seconds_total{device="dm-0"} 0
node_disk_discard_time_seconds_total{device="sda"} 0
# HELP node_disk_discarded_sectors_total The total number of sectors discarded successfully.
# TYPE node_disk_discarded_sectors_total counter
node_disk_discarded_sectors_total{device="dm-0"} 0
node_disk_discarded_sectors_total{device="sda"} 0
# HELP node_disk_discards_completed_total The total number of discards completed successfully.
# TYPE node_disk_discards_completed_total counter
node_disk_discards_completed_total{device="dm-0"} 0
node_disk_discards_completed_total{device="sda"} 0
# HELP node_disk_discards_merged_total The total number of discards merged.
# TYPE node_disk_discards_merged_total counter
node_disk_discards_merged_total{device="dm-0"} 0
node_disk_discards_merged_total{device="sda"} 0
# HELP node_disk_io_now The number of I/Os currently in progress.
# TYPE node_disk_io_now gauge
node_disk_io_now{device="dm-0"} 0
node_disk_io_now{device="sda"} 0
# HELP node_disk_io_time_seconds_total Total seconds spent doing I/Os.
# TYPE node_disk_io_time_seconds_total counter
node_disk_io_time_seconds_total{device="dm-0"} 7.521
node_disk_io_time_seconds_total{device="sda"} 32.439
# HELP node_disk_io_time_weighted_seconds_total The weighted # of seconds spent doing I/Os.
# TYPE node_disk_io_time_weighted_seconds_total counter
node_disk_io_time_weighted_seconds_total{device="dm-0"} 22.704
node_disk_io_time_weighted_seconds_total{device="sda"} 437.327
# HELP node_disk_read_bytes_total The total number of bytes read successfully.
# TYPE node_disk_read_bytes_total counter
node_disk_read_bytes_total{device="dm-0"} 1.65404672e+08
node_disk_read_bytes_total{device="sda"} 4.0217088e+08
# HELP node_disk_read_time_seconds_total The total number of seconds spent by all reads.
# TYPE node_disk_read_time_seconds_total counter
node_disk_read_time_seconds_total{device="dm-0"} 22.704
node_disk_read_time_seconds_total{device="sda"} 47.182
# HELP node_disk_reads_completed_total The total number of reads completed successfully.
# TYPE node_disk_reads_completed_total counter
node_disk_reads_completed_total{device="dm-0"} 3563
node_disk_reads_completed_total{device="sda"} 12694
# HELP node_disk_reads_merged_total The total number of reads merged.
# TYPE node_disk_reads_merged_total counter
node_disk_reads_merged_total{device="dm-0"} 0
node_disk_reads_merged_total{device="sda"} 2688
# HELP node_disk_write_time_seconds_total This is the total number of seconds spent by all writes.
# TYPE node_disk_write_time_seconds_total counter
node_disk_write_time_seconds_total{device="dm-0"} 0
node_disk_write_time_seconds_total{device="sda"} 395.37600000000003
# HELP node_disk_writes_completed_total The total number of writes completed successfully.
# TYPE node_disk_writes_completed_total counter
node_disk_writes_completed_total{device="dm-0"} 0
node_disk_writes_completed_total{device="sda"} 9496
# HELP node_disk_writes_merged_total The number of writes merged.
# TYPE node_disk_writes_merged_total counter
node_disk_writes_merged_total{device="dm-0"} 0
node_disk_writes_merged_total{device="sda"} 16906
# HELP node_disk_written_bytes_total The total number of bytes written successfully.
# TYPE node_disk_written_bytes_total counter
node_disk_written_bytes_total{device="dm-0"} 0
node_disk_written_bytes_total{device="sda"} 6.15552512e+08
# HELP node_exporter_build_info A metric with a constant '1' value labeled by version, revision, branch, and goversion from which node_exporter was built.
# TYPE node_exporter_build_info gauge
node_exporter_build_info{branch="HEAD",goversion="go1.12.5",revision="3db77732e925c08f675d7404a8c46466b2ece83e",version="0.18.1"} 1
# HELP node_filefd_allocated File descriptor statistics: allocated.
# TYPE node_filefd_allocated gauge
node_filefd_allocated 800
# HELP node_filefd_maximum File descriptor statistics: maximum.
# TYPE node_filefd_maximum gauge
node_filefd_maximum 372928
# HELP node_filesystem_avail_bytes Filesystem space available to non-root users in bytes.
# TYPE node_filesystem_avail_bytes gauge
node_filesystem_avail_bytes{device="/dev/mapper/usr",fstype="ext4",mountpoint="/usr"} 1.99036928e+08
node_filesystem_avail_bytes{device="/dev/sda1",fstype="vfat",mountpoint="/boot"} 7.9199232e+07
node_filesystem_avail_bytes{device="/dev/sda6",fstype="ext4",mountpoint="/usr/share/oem"} 6.639616e+06
node_filesystem_avail_bytes{device="/dev/sda9",fstype="ext4",mountpoint="/"} 1.7441030144e+10
# HELP node_filesystem_device_error Whether an error occurred while getting statistics for the given device.
# TYPE node_filesystem_device_error gauge
node_filesystem_device_error{device="/dev/mapper/usr",fstype="ext4",mountpoint="/usr"} 0
node_filesystem_device_error{device="/dev/sda1",fstype="vfat",mountpoint="/boot"} 0
node_filesystem_device_error{device="/dev/sda6",fstype="ext4",mountpoint="/usr/share/oem"} 0
node_filesystem_device_error{device="/dev/sda9",fstype="ext4",mountpoint="/"} 0
node_filesystem_device_error{device="/dev/sda9",fstype="ext4",mountpoint="/var/lib/rkt/pods/run/56d4cf8d-58da-433d-8d08-bf85f410613e/stage1/rootfs/opt/stage2/oem-gce/rootfs/etc"} 1
node_filesystem_device_error{device="/dev/sda9",fstype="ext4",mountpoint="/var/lib/rkt/pods/run/56d4cf8d-58da-433d-8d08-bf85f410613e/stage1/rootfs/opt/stage2/oem-gce/rootfs/home"} 1
# HELP node_filesystem_files Filesystem total file nodes.
# TYPE node_filesystem_files gauge
node_filesystem_files{device="/dev/mapper/usr",fstype="ext4",mountpoint="/usr"} 260096
node_filesystem_files{device="/dev/sda1",fstype="vfat",mountpoint="/boot"} 0
node_filesystem_files{device="/dev/sda6",fstype="ext4",mountpoint="/usr/share/oem"} 32768
node_filesystem_files{device="/dev/sda9",fstype="ext4",mountpoint="/"} 5.928832e+06
# HELP node_filesystem_files_free Filesystem total free file nodes.
# TYPE node_filesystem_files_free gauge
node_filesystem_files_free{device="/dev/mapper/usr",fstype="ext4",mountpoint="/usr"} 239511
node_filesystem_files_free{device="/dev/sda1",fstype="vfat",mountpoint="/boot"} 0
node_filesystem_files_free{device="/dev/sda6",fstype="ext4",mountpoint="/usr/share/oem"} 32742
node_filesystem_files_free{device="/dev/sda9",fstype="ext4",mountpoint="/"} 5.835061e+06
# HELP node_filesystem_free_bytes Filesystem free space in bytes.
# TYPE node_filesystem_free_bytes gauge
node_filesystem_free_bytes{device="/dev/mapper/usr",fstype="ext4",mountpoint="/usr"} 2.52301312e+08
node_filesystem_free_bytes{device="/dev/sda1",fstype="vfat",mountpoint="/boot"} 7.9199232e+07
node_filesystem_free_bytes{device="/dev/sda6",fstype="ext4",mountpoint="/usr/share/oem"} 1.6031744e+07
node_filesystem_free_bytes{device="/dev/sda9",fstype="ext4",mountpoint="/"} 1.8492092416e+10
# HELP node_filesystem_readonly Filesystem read-only status.
# TYPE node_filesystem_readonly gauge
node_filesystem_readonly{device="/dev/mapper/usr",fstype="ext4",mountpoint="/usr"} 1
node_filesystem_readonly{device="/dev/sda1",fstype="vfat",mountpoint="/boot"} 0
node_filesystem_readonly{device="/dev/sda6",fstype="ext4",mountpoint="/usr/share/oem"} 0
node_filesystem_readonly{device="/dev/sda9",fstype="ext4",mountpoint="/"} 0
# HELP node_filesystem_size_bytes Filesystem size in bytes.
# TYPE node_filesystem_size_bytes gauge
node_filesystem_size_bytes{device="/dev/mapper/usr",fstype="ext4",mountpoint="/usr"} 1.03194624e+09
node_filesystem_size_bytes{device="/dev/sda1",fstype="vfat",mountpoint="/boot"} 1.32135936e+08
node_filesystem_size_bytes{device="/dev/sda6",fstype="ext4",mountpoint="/usr/share/oem"} 1.13229824e+08
node_filesystem_size_bytes{device="/dev/sda9",fstype="ext4",mountpoint="/"} 2.352863232e+10
# HELP node_forks_total Total number of forks.
# TYPE node_forks_total counter
node_forks_total 3401
# HELP node_intr_total Total number of interrupts serviced.
# TYPE node_intr_total counter
node_intr_total 3.774581e+06
# HELP node_load1 1m load average.
# TYPE node_load1 gauge
node_load1 0.09
# HELP node_load15 15m load average.
# TYPE node_load15 gauge
node_load15 0.01
# HELP node_load5 5m load average.
# TYPE node_load5 gauge
node_load5 0.04
# HELP node_memory_Active_anon_bytes Memory information field Active_anon_bytes.
# TYPE node_memory_Active_anon_bytes gauge
node_memory_Active_anon_bytes 5.23644928e+08
# HELP node_memory_Active_bytes Memory information field Active_bytes.
# TYPE node_memory_Active_bytes gauge
node_memory_Active_bytes 6.81238528e+08
# HELP node_memory_Active_file_bytes Memory information field Active_file_bytes.
# TYPE node_memory_Active_file_bytes gauge
node_memory_Active_file_bytes 1.575936e+08
# HELP node_memory_AnonHugePages_bytes Memory information field AnonHugePages_bytes.
# TYPE node_memory_AnonHugePages_bytes gauge
node_memory_AnonHugePages_bytes 2.70532608e+08
# HELP node_memory_AnonPages_bytes Memory information field AnonPages_bytes.
# TYPE node_memory_AnonPages_bytes gauge
node_memory_AnonPages_bytes 4.09952256e+08
# HELP node_memory_Bounce_bytes Memory information field Bounce_bytes.
# TYPE node_memory_Bounce_bytes gauge
node_memory_Bounce_bytes 0
# HELP node_memory_Buffers_bytes Memory information field Buffers_bytes.
# TYPE node_memory_Buffers_bytes gauge
node_memory_Buffers_bytes 3.766272e+07
# HELP node_memory_Cached_bytes Memory information field Cached_bytes.
# TYPE node_memory_Cached_bytes gauge
node_memory_Cached_bytes 1.091960832e+09
# HELP node_memory_CommitLimit_bytes Memory information field CommitLimit_bytes.
# TYPE node_memory_CommitLimit_bytes gauge
node_memory_CommitLimit_bytes 4.0846336e+09
# HELP node_memory_Committed_AS_bytes Memory information field Committed_AS_bytes.
# TYPE node_memory_Committed_AS_bytes gauge
node_memory_Committed_AS_bytes 1.478422528e+09
# HELP node_memory_DirectMap1G_bytes Memory information field DirectMap1G_bytes.
# TYPE node_memory_DirectMap1G_bytes gauge
node_memory_DirectMap1G_bytes 1.073741824e+09
# HELP node_memory_DirectMap2M_bytes Memory information field DirectMap2M_bytes.
# TYPE node_memory_DirectMap2M_bytes gauge
node_memory_DirectMap2M_bytes 2.854223872e+09
# HELP node_memory_DirectMap4k_bytes Memory information field DirectMap4k_bytes.
# TYPE node_memory_DirectMap4k_bytes gauge
node_memory_DirectMap4k_bytes 9.8553856e+07
# HELP node_memory_Dirty_bytes Memory information field Dirty_bytes.
# TYPE node_memory_Dirty_bytes gauge
node_memory_Dirty_bytes 0
# HELP node_memory_HardwareCorrupted_bytes Memory information field HardwareCorrupted_bytes.
# TYPE node_memory_HardwareCorrupted_bytes gauge
node_memory_HardwareCorrupted_bytes 0
# HELP node_memory_HugePages_Free Memory information field HugePages_Free.
# TYPE node_memory_HugePages_Free gauge
node_memory_HugePages_Free 0
# HELP node_memory_HugePages_Rsvd Memory information field HugePages_Rsvd.
# TYPE node_memory_HugePages_Rsvd gauge
node_memory_HugePages_Rsvd 0
# HELP node_memory_HugePages_Surp Memory information field HugePages_Surp.
# TYPE node_memory_HugePages_Surp gauge
node_memory_HugePages_Surp 0
# HELP node_memory_HugePages_Total Memory information field HugePages_Total.
# TYPE node_memory_HugePages_Total gauge
node_memory_HugePages_Total 0
# HELP node_memory_Hugepagesize_bytes Memory information field Hugepagesize_bytes.
# TYPE node_memory_Hugepagesize_bytes gauge
node_memory_Hugepagesize_bytes 2.097152e+06
# HELP node_memory_Hugetlb_bytes Memory information field Hugetlb_bytes.
# TYPE node_memory_Hugetlb_bytes gauge
node_memory_Hugetlb_bytes 0
# HELP node_memory_Inactive_anon_bytes Memory information field Inactive_anon_bytes.
# TYPE node_memory_Inactive_anon_bytes gauge
node_memory_Inactive_anon_bytes 1.04685568e+08
# HELP node_memory_Inactive_bytes Memory information field Inactive_bytes.
# TYPE node_memory_Inactive_bytes gauge
node_memory_Inactive_bytes 8.684544e+08
# HELP node_memory_Inactive_file_bytes Memory information field Inactive_file_bytes.
# TYPE node_memory_Inactive_file_bytes gauge
node_memory_Inactive_file_bytes 7.63768832e+08
# HELP node_memory_KernelStack_bytes Memory information field KernelStack_bytes.
# TYPE node_memory_KernelStack_bytes gauge
node_memory_KernelStack_bytes 2.617344e+06
# HELP node_memory_Mapped_bytes Memory information field Mapped_bytes.
# TYPE node_memory_Mapped_bytes gauge
node_memory_Mapped_bytes 1.3365248e+08
# HELP node_memory_MemAvailable_bytes Memory information field MemAvailable_bytes.
# TYPE node_memory_MemAvailable_bytes gauge
node_memory_MemAvailable_bytes 2.916696064e+09
# HELP node_memory_MemFree_bytes Memory information field MemFree_bytes.
# TYPE node_memory_MemFree_bytes gauge
node_memory_MemFree_bytes 2.172424192e+09
# HELP node_memory_MemTotal_bytes Memory information field MemTotal_bytes.
# TYPE node_memory_MemTotal_bytes gauge
node_memory_MemTotal_bytes 3.874308096e+09
# HELP node_memory_Mlocked_bytes Memory information field Mlocked_bytes.
# TYPE node_memory_Mlocked_bytes gauge
node_memory_Mlocked_bytes 0
# HELP node_memory_NFS_Unstable_bytes Memory information field NFS_Unstable_bytes.
# TYPE node_memory_NFS_Unstable_bytes gauge
node_memory_NFS_Unstable_bytes 0
# HELP node_memory_PageTables_bytes Memory information field PageTables_bytes.
# TYPE node_memory_PageTables_bytes gauge
node_memory_PageTables_bytes 5.59104e+06
# HELP node_memory_Percpu_bytes Memory information field Percpu_bytes.
# TYPE node_memory_Percpu_bytes gauge
node_memory_Percpu_bytes 462848
# HELP node_memory_SReclaimable_bytes Memory information field SReclaimable_bytes.
# TYPE node_memory_SReclaimable_bytes gauge
node_memory_SReclaimable_bytes 6.1435904e+07
# HELP node_memory_SUnreclaim_bytes Memory information field SUnreclaim_bytes.
# TYPE node_memory_SUnreclaim_bytes gauge
node_memory_SUnreclaim_bytes 2.8110848e+07
# HELP node_memory_ShmemHugePages_bytes Memory information field ShmemHugePages_bytes.
# TYPE node_memory_ShmemHugePages_bytes gauge
node_memory_ShmemHugePages_bytes 0
# HELP node_memory_ShmemPmdMapped_bytes Memory information field ShmemPmdMapped_bytes.
# TYPE node_memory_ShmemPmdMapped_bytes gauge
node_memory_ShmemPmdMapped_bytes 0
# HELP node_memory_Shmem_bytes Memory information field Shmem_bytes.
# TYPE node_memory_Shmem_bytes gauge
node_memory_Shmem_bytes 2.09473536e+08
# HELP node_memory_Slab_bytes Memory information field Slab_bytes.
# TYPE node_memory_Slab_bytes gauge
node_memory_Slab_bytes 8.9546752e+07
# HELP node_memory_SwapCached_bytes Memory information field SwapCached_bytes.
# TYPE node_memory_SwapCached_bytes gauge
node_memory_SwapCached_bytes 0
# HELP node_memory_SwapFree_bytes Memory information field SwapFree_bytes.
# TYPE node_memory_SwapFree_bytes gauge
node_memory_SwapFree_bytes 2.147479552e+09
# HELP node_memory_SwapTotal_bytes Memory information field SwapTotal_bytes.
# TYPE node_memory_SwapTotal_bytes gauge
node_memory_SwapTotal_bytes 2.147479552e+09
# HELP node_memory_Unevictable_bytes Memory information field Unevictable_bytes.
# TYPE node_memory_Unevictable_bytes gauge
node_memory_Unevictable_bytes 0
# HELP node_memory_VmallocChunk_bytes Memory information field VmallocChunk_bytes.
# TYPE node_memory_VmallocChunk_bytes gauge
node_memory_VmallocChunk_bytes 0
# HELP node_memory_VmallocTotal_bytes Memory information field VmallocTotal_bytes.
# TYPE node_memory_VmallocTotal_bytes gauge
node_memory_VmallocTotal_bytes 3.5184372087808e+13
# HELP node_memory_VmallocUsed_bytes Memory information field VmallocUsed_bytes.
# TYPE node_memory_VmallocUsed_bytes gauge
node_memory_VmallocUsed_bytes 0
# HELP node_memory_WritebackTmp_bytes Memory information field WritebackTmp_bytes.
# TYPE node_memory_WritebackTmp_bytes gauge
node_memory_WritebackTmp_bytes 0
# HELP node_memory_Writeback_bytes Memory information field Writeback_bytes.
# TYPE node_memory_Writeback_bytes gauge
node_memory_Writeback_bytes 0
# HELP node_network_receive_bytes_total Network device statistic receive_bytes.
# TYPE node_network_receive_bytes_total counter
node_network_receive_bytes_total{device="docker0"} 19250
node_network_receive_bytes_total{device="ens4v1"} 5.6529457e+07
node_network_receive_bytes_total{device="lo"} 51068
node_network_receive_bytes_total{device="veth0aad433"} 1426
# HELP node_network_receive_compressed_total Network device statistic receive_compressed.
# TYPE node_network_receive_compressed_total counter
node_network_receive_compressed_total{device="docker0"} 0
node_network_receive_compressed_total{device="ens4v1"} 0
node_network_receive_compressed_total{device="lo"} 0
node_network_receive_compressed_total{device="veth0aad433"} 0
# HELP node_network_receive_drop_total Network device statistic receive_drop.
# TYPE node_network_receive_drop_total counter
node_network_receive_drop_total{device="docker0"} 0
node_network_receive_drop_total{device="ens4v1"} 0
node_network_receive_drop_total{device="lo"} 0
node_network_receive_drop_total{device="veth0aad433"} 0
# HELP node_network_receive_errs_total Network device statistic receive_errs.
# TYPE node_network_receive_errs_total counter
node_network_receive_errs_total{device="docker0"} 0
node_network_receive_errs_total{device="ens4v1"} 0
node_network_receive_errs_total{device="lo"} 0
node_network_receive_errs_total{device="veth0aad433"} 0
# HELP node_network_receive_fifo_total Network device statistic receive_fifo.
# TYPE node_network_receive_fifo_total counter
node_network_receive_fifo_total{device="docker0"} 0
node_network_receive_fifo_total{device="ens4v1"} 0
node_network_receive_fifo_total{device="lo"} 0
node_network_receive_fifo_total{device="veth0aad433"} 0
# HELP node_network_receive_frame_total Network device statistic receive_frame.
# TYPE node_network_receive_frame_total counter
node_network_receive_frame_total{device="docker0"} 0
node_network_receive_frame_total{device="ens4v1"} 0
node_network_receive_frame_total{device="lo"} 0
node_network_receive_frame_total{device="veth0aad433"} 0
# HELP node_network_receive_multicast_total Network device statistic receive_multicast.
# TYPE node_network_receive_multicast_total counter
node_network_receive_multicast_total{device="docker0"} 0
node_network_receive_multicast_total{device="ens4v1"} 0
node_network_receive_multicast_total{device="lo"} 0
node_network_receive_multicast_total{device="veth0aad433"} 0
# HELP node_network_receive_packets_total Network device statistic receive_packets.
# TYPE node_network_receive_packets_total counter
node_network_receive_packets_total{device="docker0"} 327
node_network_receive_packets_total{device="ens4v1"} 17138
node_network_receive_packets_total{device="lo"} 293
node_network_receive_packets_total{device="veth0aad433"} 19
# HELP node_network_transmit_bytes_total Network device statistic transmit_bytes.
# TYPE node_network_transmit_bytes_total counter
node_network_transmit_bytes_total{device="docker0"} 1.3409502e+07
node_network_transmit_bytes_total{device="ens4v1"} 6.477103e+06
node_network_transmit_bytes_total{device="lo"} 51068
node_network_transmit_bytes_total{device="veth0aad433"} 3094
# HELP node_network_transmit_carrier_total Network device statistic transmit_carrier.
# TYPE node_network_transmit_carrier_total counter
node_network_transmit_carrier_total{device="docker0"} 0
node_network_transmit_carrier_total{device="ens4v1"} 0
node_network_transmit_carrier_total{device="lo"} 0
node_network_transmit_carrier_total{device="veth0aad433"} 0
# HELP node_network_transmit_colls_total Network device statistic transmit_colls.
# TYPE node_network_transmit_colls_total counter
node_network_transmit_colls_total{device="docker0"} 0
node_network_transmit_colls_total{device="ens4v1"} 0
node_network_transmit_colls_total{device="lo"} 0
node_network_transmit_colls_total{device="veth0aad433"} 0
# HELP node_network_transmit_compressed_total Network device statistic transmit_compressed.
# TYPE node_network_transmit_compressed_total counter
node_network_transmit_compressed_total{device="docker0"} 0
node_network_transmit_compressed_total{device="ens4v1"} 0
node_network_transmit_compressed_total{device="lo"} 0
node_network_transmit_compressed_total{device="veth0aad433"} 0
# HELP node_network_transmit_drop_total Network device statistic transmit_drop.
# TYPE node_network_transmit_drop_total counter
node_network_transmit_drop_total{device="docker0"} 0
node_network_transmit_drop_total{device="ens4v1"} 0
node_network_transmit_drop_total{device="lo"} 0
node_network_transmit_drop_total{device="veth0aad433"} 0
# HELP node_network_transmit_errs_total Network device statistic transmit_errs.
# TYPE node_network_transmit_errs_total counter
node_network_transmit_errs_total{device="docker0"} 0
node_network_transmit_errs_total{device="ens4v1"} 0
node_network_transmit_errs_total{device="lo"} 0
node_network_transmit_errs_total{device="veth0aad433"} 0
# HELP node_network_transmit_fifo_total Network device statistic transmit_fifo.
# TYPE node_network_transmit_fifo_total counter
node_network_transmit_fifo_total{device="docker0"} 0
node_network_transmit_fifo_total{device="ens4v1"} 0
node_network_transmit_fifo_total{device="lo"} 0
node_network_transmit_fifo_total{device="veth0aad433"} 0
# HELP node_network_transmit_packets_total Network device statistic transmit_packets.
# TYPE node_network_transmit_packets_total counter
node_network_transmit_packets_total{device="docker0"} 596
node_network_transmit_packets_total{device="ens4v1"} 13473
node_network_transmit_packets_total{device="lo"} 293
node_network_transmit_packets_total{device="veth0aad433"} 33
# HELP node_procs_blocked Number of processes blocked waiting for I/O to complete.
# TYPE node_procs_blocked gauge
node_procs_blocked 0
# HELP node_procs_running Number of processes in runnable state.
# TYPE node_procs_running gauge
node_procs_running 4
# HELP node_scrape_collector_duration_seconds node_exporter: Duration of a collector scrape.
# TYPE node_scrape_collector_duration_seconds gauge
node_scrape_collector_duration_seconds{collector="cpu"} 0.000450626
node_scrape_collector_duration_seconds{collector="diskstats"} 0.000150138
node_scrape_collector_duration_seconds{collector="filefd"} 7.226e-05
node_scrape_collector_duration_seconds{collector="filesystem"} 0.001281706
node_scrape_collector_duration_seconds{collector="loadavg"} 0.000104301
node_scrape_collector_duration_seconds{collector="meminfo"} 0.000855505
node_scrape_collector_duration_seconds{collector="netdev"} 0.00561608
node_scrape_collector_duration_seconds{collector="pressure"} 7.077e-05
node_scrape_collector_duration_seconds{collector="stat"} 0.00259188
node_scrape_collector_duration_seconds{collector="uname"} 1.4888e-05
# HELP node_scrape_collector_success node_exporter: Whether a collector succeeded.
# TYPE node_scrape_collector_success gauge
node_scrape_collector_success{collector="cpu"} 1
node_scrape_collector_success{collector="diskstats"} 1
node_scrape_collector_success{collector="filefd"} 1
node_scrape_collector_success{collector="filesystem"} 1
node_scrape_collector_success{collector="loadavg"} 1
node_scrape_collector_success{collector="meminfo"} 1
node_scrape_collector_success{collector="netdev"} 1
node_scrape_collector_success{collector="pressure"} 1
node_scrape_collector_success{collector="stat"} 1
node_scrape_collector_success{collector="uname"} 1
# HELP node_uname_info Labeled system information as provided by the uname system call.
# TYPE node_uname_info gauge
node_uname_info{domainname="(none)",machine="x86_64",nodename="packer-5d5e7d8e-b691-905d-70bb-659122f024a9.c.gitlab-ci-155816.i",release="4.19.23-coreos-r1",sysname="Linux",version="#1 SMP Mon Feb 25 23:40:01 -00 2019"} 1

Interesting error messages:

"error":"read |0: file already closed","level":"warning","msg":"Problem while reading command output","tag":"runner","environment":"ci-prd","hostname":"shared-runners-manager-6","fqdn":"shared-runners-manager-6.gitlab.com"}

DB info:

=> #<Ci::Build id: 307405134, status: "canceled", finished_at: "2019-09-30 07:24:42", trace: nil, created_at: "2019-09-30 06:59:38", updated_at: "2019-09-30 07:24:42", started_at: "2019-09-30 06:59:38", runner_id: 380987, coverage: nil, commit_id: 85398094, name: "build", options: {:image=>{:name=>"xxx"}, :artifacts=>{:paths=>["dist"]}, :script=>[xxx"]}, allow_failure: false, stage: "build", trigger_request_id: nil, stage_idx: 0, tag: false, ref: "master", user_id: xxx, type: "Ci::Build", target_url: nil, description: nil, project_id: xxx, erased_by_id: nil, erased_at: nil, artifacts_expire_at: nil, environment: nil, when: "on_success", yaml_variables: [], queued_at: "2019-09-30 06:59:38", token: nil, lock_version: 3, coverage_regex: nil, auto_canceled_by_id: nil, retried: true, stage_id: 150460386, protected: true, failure_reason: nil, scheduled_at: nil, token_encrypted: "xxxx", upstream_pipeline_id: nil, tag_list: nil>

Job ID: 304673282

Runner Manager: shared-runnes-manager-4.gitlab.com

Rails Logs:

Traces:

There seems to be a large amount of PUT request

Runner Logs:

All logs related with the Job ID:

All logs related to Runner machine:

DB info:

=> #<Ci::Build id: 304673282, status: "running", finished_at: nil, trace: nil, created_at: "2019-09-26 10:15:37", updated_at: "2019-09-30 12:23:33", started_at: "2019-09-30 08:19:40", runner_id: 44949, coverage: nil, commit_id: 84639452, name: "container_scanning", options: {:image=>{:name=>"docker:stable"}, :services=>[{:name=>"docker:stable-dind"}], :artifacts=>{:reports=>{:container_scanning=>["gl-container-scanning-report.json"]}}, :dependencies=>[], :before_script=>["apk update && apk add git", "export TAG=\"${CI_PIPELINE_ID}-$(git log --oneline HEAD | wc -l | sed 's/^ *//')-$(git rev-parse --short=10 HEAD)\"", "export IMAGE=\"registry.sandbox.glympse.com/enroute/web\"", "export GO111MODULE=on", "which ssh-agent || (apk update && apk add openssh-client)", "eval $(ssh-agent -s)", "echo \"$SSH_PRIVATE_KEY\" | tr -d '\\r' | ssh-add - > /dev/null", "mkdir -p ~/.ssh", "chmod 700 ~/.ssh", "echo -e \"Host github.com\\n\\tStrictHostKeyChecking no\\n\" >> ~/.ssh/config", "echo -e \"Host gitlab.com\\n\\tStrictHostKeyChecking no\\n\" >> ~/.ssh/config", "git config --global --add url.\"git@github.com:\".insteadOf \"https://github.com\"", "git config --global --add url.\"git@gitlab.com:\".insteadOf \"https://gitlab.com\"", "cat ~/.gitconfig", "git config --global url.\"https://gitlab-ci-token:${CI_JOB_TOKEN}@gitlab.com\".insteadOf \"https://gitlab.com\""], :script=>["docker login -u gitlab-ci-token -p $CI_JOB_TOKEN $CI_REGISTRY", "docker run -d --name db arminc/clair-db:latest", "docker run -p 6060:6060 --link db:postgres -d --name clair --restart on-failure arminc/clair-local-scan:v2.0.1", "apk add -U wget ca-certificates", "docker pull ${CI_APPLICATION_REPOSITORY}:${CI_APPLICATION_TAG}", "wget https://github.com/arminc/clair-scanner/releases/download/v8/clair-scanner_linux_amd64", "mv clair-scanner_linux_amd64 clair-scanner", "chmod +x clair-scanner", "touch clair-whitelist.yml", "while( ! wget -q -O /dev/null http://docker:6060/v1/namespaces ) ; do sleep 1 ; done", "retries=0", "echo \"Waiting for clair daemon to start\"", "while( ! wget -T 10 -q -O /dev/null http://docker:6060/v1/namespaces ) ; do sleep 1 ; echo -n \".\" ; if [ $retries -eq 10 ] ; then echo \" Timeout, aborting.\" ; exit 1 ; fi ; retries=$(($retries+1)) ; done", "./clair-scanner -c http://docker:6060 --ip $(hostname -i) -r gl-container-scanning-report.json -l clair.log -w clair-whitelist.yml ${CI_APPLICATION_REPOSITORY}:${CI_APPLICATION_TAG} || true"]}, allow_failure: true, stage: "security", trigger_request_id: nil, stage_idx: 3, tag: false, ref: "167932637_pin_reset_flow_ui", user_id: xxxx, type: "Ci::Build", target_url: nil, description: nil, project_id: xxx, erased_by_id: nil, erased_at: nil, artifacts_expire_at: nil, environment: nil, when: "on_success", yaml_variables: [{:key=>"LICENSE_MANAGEMENT_SETUP_CMD", :value=>"", :public=>true}, {:key=>"DOCKER_DRIVER", :value=>"overlay2", :public=>true}, {:key=>"CI_APPLICATION_REPOSITORY", :value=>"$CI_REGISTRY_IMAGE/$CI_COMMIT_REF_SLUG", :public=>true}, {:key=>"CI_APPLICATION_TAG", :value=>"$CI_COMMIT_SHA", :public=>true}, {:key=>"GIT_STRATEGY", :value=>"none", :public=>true}], queued_at: "2019-09-30 08:19:40", token: nil, lock_version: 4, coverage_regex: nil, auto_canceled_by_id: nil, retried: nil, stage_id: 148915407, protected: false, failure_reason: nil, scheduled_at: nil, token_encrypted: "xxx", upstream_pipeline_id: nil, tag_list: nil>

Look right now (October 1st 07:20 UTC), GitLab Runner is still sending updates regarding this job to /api/v4/jobs/304673282 and the state is still set to running:

Request payload GitLab Runner is sending to GitLab coordinator

{
  "key": "info",
  "value": "{\"name\"=>\"gitlab-runner\", \"version\"=>\"12.3.0\", \"revision\"=>\"a8a019e0\", \"platform\"=>\"linux\", \"architecture\"=>\"amd64\", \"executor\"=>\"docker+machine\", \"shell\"=>\"bash\", \"features\"=>{\"variables\"=>\"[FILTERED]\", \"image\"=>nil, \"services\"=>nil, \"artifacts\"=>nil, \"cache\"=>nil, \"shared\"=>nil, \"upload_multiple_artifacts\"=>nil, \"upload_raw_artifacts\"=>nil, \"session\"=>nil, \"terminal\"=>nil, \"refspecs\"=>nil, \"masking\"=>nil, \"proxy\"=>nil}}"
},
{
  "key": "token",
  "value": "[FILTERED]"
},
{
  "key": "state",
  "value": "running"
}

Observations

The strange part about this, is the state from the http://localhost:9402/debug/jobs/list?v=2 endpoint on gitlab-shared-runner-manager-4.gitlab.com we see the following, url=https://gitlab.com/xxxx/xxx/xxxx/-/jobs/304673282 state=timedout stage=get_sources executor_stage=docker_run duration=23h4m39.58570755s here we can there is a "split-brain" issue, where state is timedout for Runner but when we send an update we are setting it to running. This is because in touchJob which is send periodically we hardcode the state running. We can't send the real state of the job in touchJob because of issues like gitlab-foss#63972 (closed)
There is no log of the job being timed out Screen_Shot_2019-10-01_at_10.08.20, but for reason, the state is still updated

Job ID: 308722314

State in GitLab Runner:

shared-runners-manager-6.gitlab.com         url=https://gitlab.com/xxxx/xxxxxt/-/jobs/308722314 state=canceled stage=restore_cache executor_stage=docker_run duration=28h4m46.064828678s

State in GitLab DB:

=> #<Ci::Build id: 308722314, status: "canceled", finished_at: "2019-10-01 11:29:23", trace: nil, created_at: "2019-10-01 11:13:37", updated_at: "2019-10-01 11:29:23", started_at: "2019-10-01 11:17:42", runner_id: 380987, coverage: nil, commit_id: 85728000, name: "xxxx", options: {:cache=>{:paths=>["xxxx"], :key=>"default", :policy=>"pull-push"}, :script=>["xxxxx", "xxxxx"], :after_script=>[xxxxxx]}, allow_failure: false, stage: "deploy", trigger_request_id: nil, stage_idx: 2, tag: false, ref: "fix/migrationIBM", user_id: xxx, type: "Ci::Build", target_url: nil, description: nil, project_id: xxxx, erased_by_id: nil, erased_at: nil, artifacts_expire_at: nil, environment: nil, when: "on_success", yaml_variables: [], queued_at: "2019-10-01 11:17:41", token: nil, lock_version: 3, coverage_regex: nil, auto_canceled_by_id: nil, retried: nil, stage_id: 151117628, protected: false, failure_reason: nil, scheduled_at: nil, token_encrypted: "xxxx", upstream_pipeline_id: nil, tag_list: nil>

As we can see here, the job is canceled for the user, but for GitLab Runner it's still hanging around since it was canceled but the build never finished, which verifies the issue described in #4147 (comment 224456292)

If we look at the normal canceled job we see the following logs:

But in the case of 308722314 we only see up the trace cancellation but not clean up the environment

Job ID: 309441198

State in GitLab Runner:

shared-runners-manager-5.gitlab.com url=https://gitlab.com/xxx/xxxx/-/jobs/309441198 state=finished stage=upload_artifacts_on_success executor_stage=finish duration=26h33m53.364352303s

State in Rails DB:

<Ci::Build id: 309441198, status: "failed", finished_at: "2019-10-02 06:00:01", trace: nil, created_at: "2019-10-02 03:46:49", updated_at: "2019-10-02 06:00:01", started_at: "2019-10-02 03:46:50", runner_id: 380986, coverage: nil, commit_id: 85912260, name: "clang-7-nopcap", options: {:image=>{:name=>"xxxx"}, :before_script=>["xx", "xxx", "xxx", "xx", "xx", "xx", "xx", "xx"], :script=>["xx", "xx", "xx", "xx", "xxx", "xxx"], :after_script=>["fxxx", "xxx"], :retry=>{:max=>1}}, allow_failure: false, stage: "build", trigger_request_id: nil, stage_idx: 0, tag: false, ref: "master", user_id: xxx, type: "Ci::Build", target_url: nil, description: nil, project_id: xxxx, erased_by_id: nil, erased_at: nil, artifacts_expire_at: nil, environment: nil, when: "on_success", yaml_variables: [{:key=>"xxx", :value=>"xxx", :public=>true}, {:key=>"CC", :value=>"clang-7", :public=>true}, {:key=>"xxx", :value=>"xxxx", :public=>true}], queued_at: "2019-10-02 03:46:49", token: nil, lock_version: 3, coverage_regex: nil, auto_canceled_by_id: nil, retried: true, stage_id: 151479444, protected: true, failure_reason: "stuck_or_timeout_failure", scheduled_at: nil, token_encrypted: "NLDIMwAnPyTCYHJzkOZ5wax208LTZzT+ifolyf6Eq6KlmFPh", upstream_pipeline_id: nil, tag_list: nil>

The life cycle of the job on Runner Logs:

The life cycle of the Runner machine:

Observations

Job marked as failed in Rails DB with stuck_or_timeout_failure, but inside of Runner it's marked as Job Succeeded
There is a range of mismatch from the Runner side. Looking at rails logs there seems to be a good amount of 202, apart from the last request, the log from the Runner shows the sent-log {"code":416,"job":309441198,"job-log":"0-2668956","job-status":"","level":"warning","msg":"Appending trace to coordinator... range mismatch","runner":"0277ea0f","sent-log":"267475-268078","status":"416 Requested Range Not Satisfiable","tag":"runner","environment":"ci-prd","hostname":"shared-runners-manager-5","fqdn":"shared-runners-manager-5.gitlab.com"}
Given that job successed, but no final update/trace was sent from GitLab Runner, the StuckCiJobsWorker picked it up and marked it as failed. Which leads me to think that this is the same issue as #4147 (comment 224456292)

Edited Oct 03, 2019 by Steve Xuereb