Runner system failure rate
I was interested about Runner system failure rate.
Doing some searches on Kibana from last day for the whole day:
"Job succeeded": 95480
"Job failed": 27782
"Job failed" "system failure": 512
"Job failed" "system failure" "Error: No such image": 134
"Job failed" "system failure" "Cannot connect to the Docker daemon": 290
"Job failed" "system failure" "invalid reference format": 30
"Job failed" "system failure" "Error response from daemon: Conflict": 6
"Job failed" "system failure" "executable file not found in": 10
"Job failed" "system failure" "Error relabeling upper directory: SELinux relabeling of": 16
"Job failed" "system failure" "No such container": 2
"Job failed" "system failure" "persistent connection closed": 10
"Job failed" "system failure" "no such host": 12
"system failure" -"Error: No such image" -"Cannot connect to the Docker daemon" -"invalid reference format" -"Error response from daemon: Conflict" -"executable file not found in" -"Error relabeling upper directory: SELinux relabeling of" -"No such container" -"persistent connection closed" -"no such host": 12
This gives us a system failure rate at (512-134-12)/(95480+27782)*100% = 0,29%.
The No such image is not a system failure, as the user-supplied image name is not present.
The No such host is not a system failure, as the user-supplied registry is not available (private networking?).
It is hard to say now how many of these failures are fatal, as we for most of them we perform retries, and it is possible that the effective failure rate is a way lower at around 0.01%.
Also, this is a sample of one day, so this should be measured across multiple days.
Edited by Kamil Trzciński