Skip to content

Retry kubernetes commands when "error dialing backend: EOF" error is hit

Georgi N. Georgiev requested to merge retry_error_dialing_backend_eof into master

What does this MR do?

Adds a retry mechanism for all kubernetes commands when an error dialing backend: EOF error is hit. This error occurs only if the remote host is unreachable. This means that we are sure it's safe to retry a command since it hasn't been executed.

Why was this MR needed?

As outlined in this comment currently this happens if the connection between the runner and the kubernetes pod which is running a command is severed(much like in #4119 (closed)). This then leads the next command being executed immediately. If the network is intermittently down, the kubernetes master might not be reachable as well, which leads to the error dialing backend: EOF error. This error can also occur during normal command execution. This MR mostly guards against such intermittent failures. When merged, it will be integrated in !1775 (merged), which adds further stability to commands execution.

Are there points in the code the reviewer needs to double check?

Some commands are already retried in common.Build.attemptExecuteStage. I opted in retrying all commands, regardless if the whole build considers them retryable since this error is pretty much kubernetes-local as of right now. In the future, if other executors require such local retry mechanism, we could move the abstraction further up towards common.Build.

I am thinking if we should increase the backoff timings. If e.g. a kube system pod fails, maybe we could give it 1 minute to try and get back up?

How to test

The easiest way to recreate this error is by:

  1. Creating a K8S cluster in GKE
  2. Having a node pool with the size of 1 node
  3. Running a job on this node:
sleep:
    script:
        - date
        - sleep 900
        - echo "done"
        - date
  1. Then while the job is running, go to the node pool console and downscale it to 0
  2. After a minute or so you should see in the logs: ERROR: Job failed (system failure): error dialing backend: No SSH tunnels currently open. Were the targets able to accept an ssh-key for user "gke-ea4bb2c27101248b1c63"?. Which is an error produced after a few seconds of trying, basically k8s was able to give a better error than error dialing backend: EOF at this point. There are also warn logs for each retry.

Does this MR meet the acceptance criteria?

  • Documentation created/updated
  • Added tests for this feature/bug
  • In case of conflicts with master - branch was rebased

What are the relevant issue numbers?

Related To #3247 (closed)

Edited by Georgi N. Georgiev

Merge request reports