Enabling FF_KUBERNETES_HONOR_ENTRYPOINT in a job causes them to immediately fail.
Steps to reproduce
Any time we set this in a job or in the runner's env vars, the job fails. The tf_deploy image is one that has an entrypoint set in it's dockerfile like so: ENTRYPOINT ["/usr/local/bin/tf_deploy.sh"]
.gitlab-ci.yml
# deploy the all/tooling targetdeploy_all_tooling:extends:.tf_deploy_templatetags:-tooling-sandbox-infra-runnerenvironment:name:all/toolingresource_group:all_toolingvariables:TARGET:all/toolingGIT_SUBMODULE_STRATEGY:normalEKS_CLUSTER:tooling-sandbox-infraFF_KUBERNETES_HONOR_ENTRYPOINT:"true"rules:-if:'$CI_PIPELINE_SOURCE=="merge_request_event"'when:never-if:$CI_SERVER_HOST != "gitlab.login.gov"when:never-if:$CI_PIPELINE_SOURCE == "schedule"when:never# XXX change this to main when we are done-if:$CI_COMMIT_BRANCH == "tspencer/non_idp_terraform_environments"when:alwaysscript:# XXX We want to turn on FF_KUBERNETES_HONOR_ENTRYPOINT in the runner. See infra-runner-values.yaml-/usr/local/bin/tf_deploy.sh# deploy job template for tf-deploy.tf_deploy_template:image:name:$AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/cd/tf_deploy/blessed@$TF_DEPLOY_IMAGE_DIGESTstage:deployartifacts:name:"$CI_ENVIRONMENT_NAME-$CI_COMMIT_SHA"paths:-terraform.plan-plan.txtexpire_in:1 yearreports:terraform:plan.jsonscript:-echo "Yay deploys!" ; exit 1
Actual behavior
The build container seems to immediately die.
Running with gitlab-runner 16.5.0 (853330f9) on tooling-sandbox-infra-gitlab-infra-runner-gitlab-runner-6d5g6tf 62zGgYi6, system ID: r_ZZqSDxQWTouy feature flags: FF_KUBERNETES_HONOR_ENTRYPOINT:trueResolving secretsPreparing the "kubernetes" executorUsing Kubernetes namespace: gitlabUsing Kubernetes executor with image [MASKED].dkr.ecr.[MASKED].amazonaws.com/cd/tf_deploy/blessed@sha256:6488f2c2690c93d9beed80185b94928a5fb848b1ac81636aa441dee7a6702597 ...Using attach strategy to execute scripts...Preparing environmentUsing FF_USE_POD_ACTIVE_DEADLINE_SECONDS, the Pod activeDeadlineSeconds will be set to the job timeout: 1h0m0s...Waiting for pod gitlab/runner-62zggyi6-project-21-concurrent-0-asgpqyn2 to be running, status is PendingERROR: Job failed (system failure): prepare environment: setting up trapping scripts on emptyDir: unable to upgrade connection: container not found ("build"). Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information
Expected behavior
When I set FF_KUBERNETES_HONOR_ENTRYPOINT, I expect it to run the container without a command set, and thus for it to run the entrypoint in the image regardless of what command/args/scripts/entrypoints are set. It should run like it does when we turn off FF_KUBERNETES_HONOR_ENTRYPOINT, as shown below:
Running with gitlab-runner 16.5.0 (853330f9) on tooling-sandbox-infra-gitlab-infra-runner-gitlab-runner-7dnn4cj is1RksBs, system ID: r_AHy9RejbA9IqResolving secrets00:00Preparing the "kubernetes" executor00:00Using Kubernetes namespace: gitlabUsing Kubernetes executor with image [MASKED].dkr.ecr.[MASKED].amazonaws.com/cd/tf_deploy/blessed@sha256:6488f2c2690c93d9beed80185b94928a5fb848b1ac81636aa441dee7a6702597 ...Using attach strategy to execute scripts...Preparing environment00:05Using FF_USE_POD_ACTIVE_DEADLINE_SECONDS, the Pod activeDeadlineSeconds will be set to the job timeout: 1h0m0s...Waiting for pod gitlab/runner-is1rksbs-project-21-concurrent-0-2i8r38eu to be running, status is Pending ContainersNotInitialized: "containers with incomplete status: [init-permissions]" ContainersNotReady: "containers with unready status: [kuma-sidecar build helper]" ContainersNotReady: "containers with unready status: [kuma-sidecar build helper]"Running on runner-is1rksbs-project-21-concurrent-0-2i8r38eu via tooling-sandbox-infra-gitlab-infra-runner-gitlab-runner-7dnn4cj...Getting source from Git repository00:07Fetching changes with git depth set to 20...Initialized empty Git repository in /builds/lg/identity-devops/.git/Created fresh repository.Checking out a3e0ae53 as detached HEAD (ref is tspencer/non_idp_terraform_environments)...Updating/initializing submodules with git depth set to 20...Submodule 'identity-devops-private' (https://gitlab-ci-token:[MASKED]@gitlab.login.gov/lg/identity-devops-private.git) registered for path 'identity-devops-private'Synchronizing submodule url for 'identity-devops-private'Cloning into '/builds/lg/identity-devops/identity-devops-private'...Submodule path 'identity-devops-private': checked out '479ba929bd4a9ecb27042392f99652b423e98c7c'Updated submodulesEntering 'identity-devops-private'Entering 'identity-devops-private'Executing "step_script" stage of the job script02:29$ /usr/local/bin/tf_deploy.shTARGET is valid format: all/tooling...the rest of the properly functioning job output is trimmed here...
Relevant logs and/or screenshots
See above for the relevant job logs.
Environment description
We are running a self-hosted gitlab 16.5.0 with k8s runners in a dedicated EKS cluster.
config.toml contents
image:registry:${accountid}.dkr.ecr.us-west-2.amazonaws.comimage:ecr-public/gitlab/gitlab-runnertag:ubi-fips-v16.5.0podAnnotations:kuma.io/mesh:gitlabconcurrent:5gitlabUrl:https://gitlab.login.gov/rbac:create:truelogLevel:debugrunners:config:|[[runners]]url="https://gitlab.login.gov"# XXX we really want to set this, but this is broken.environment=["FF_KUBERNETES_HONOR_ENTRYPOINT=true"][runners.kubernetes]namespace="gitlab"service_account="${irsa_sa}"helper_image="${accountid}.dkr.ecr.us-west-2.amazonaws.com/ecr-public/gitlab/gitlab-runner-helper:ubi-fips-x86_64-v16.5.0"secret:gitlab-runner-secretpodAnnotations:kuma.io/mesh:gitlab
Used GitLab Runner version
Running with gitlab-runner 16.5.0 (853330f9) on tooling-sandbox-infra-gitlab-infra-runner-gitlab-runner-6d5g6tf 62zGgYi6, system ID: r_ZZqSDxQWTouy feature flags: FF_KUBERNETES_HONOR_ENTRYPOINT:trueResolving secretsPreparing the "kubernetes" executor
Possible fixes
I thought that #29172 (comment 1676276649) was the problem, but I created helper and gitlab-runner images with entrypoints by using dockerfiles like so:
FROM 217680906704.dkr.ecr.us-west-2.amazonaws.com/ecr-public/gitlab/gitlab-runner-helper:ubi-fips-x86_64-v16.5.0ENTRYPOINT ["/usr/bin/dumb-init", "/entrypoint"]
and it still didn't fix it. We tried changing FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY, but that didn't help. I also looked at #30713 (closed), which looks like it might be related, but they also seem to want to change how the feature works, and I'm mystified, because it sounds like their stuff actually works the way we want it to. We really need FF_KUBERNETES_HONOR_ENTRYPOINT to actually force the runner to honor the entrypoint supplied in the image, regardless of what commands are supplied in the job.
@JamesRLopes is the entrypoint issue happening with GitLab Runner v16.6 ? If yes, it is more likely the one described in this #37205 (closed). As mentioned in this comment, we are taking the steps to revert the changes made in v16.6 and revert to the old behaviour in v16.5.
However, based on @timspencer issue, we still have an issue with v16.5. I will investigate that too to identify the root cause.
Two different images were used with the following entrypoint:
entrypoint 1
#!/bin/shecho"Number of arguments: $#"echo"All arguments: \$@ is $@"i=1while[$i-le 10 ];doecho"Iteration $i - Current Time: $(date +"%T")"i=$((i +1))sleep 1done
entrypoint 2
#!/bin/shecho"Number of arguments: $#"echo"All arguments: \$@ is $@"i=1while[$i-le 10 ];doecho"Iteration $i - Current Time: $(date +"%T")"i=$((i +1))sleep 1doneexec"$@"
AFAICT, the only time I was able to reproduce the issue is when the entrypoint lacks the exec "$@" command at the end of the script.
In that situation, what is actually happening is
entrypoint 1
The entrypoint executes quite rapidly and when we get to the job script execution, the build container has the completed status so the whole job fails
entrypoint 2
The job script get executed and passes as expected.
In your case the error is slightly different as I guess the entrypoint finishes its execution when Runner is trying to save a stage script on the build container.
Have exec "$@" at the end of the entrypoint point allows us to open a shell on the build container to prevent it for completing.
Do you have such instruction at the end of your entrypoint ?
I do not have an exec at the end of our entrypoint script. Our entrypoint script should take 5-15 minutes to execute, though, so if you are trying to do an exec into the container to do extra things, there should be plenty of time for it to do that. For us, it just looks like it doesn't execute the job at all, or it is exiting immediately before it can check out code or whatever.
But I will try to add an exec at the end of the script and see what happens. Is this a thing that is documented somewhere? I've never seen any mention of us needing to do an exec "$@" at the end of our container scripts.
It basically creates a container with a few aws tools and kubectl, so that we can execute commands against our EKS clusters. The entrypoint.sh script contains the following:
We need to set KUBECONFIG in the .gitlab-ci.yml file, otherwise it will be overriden by some other process we do not control.
We need to set FF_KUBERNETES_HONOR_ENTRYPOINT, otherwise, the credentials will not be set and it fails.
Here's what happens:
Variables set and a 15.8.3 runner:
Running with gitlab-runner 15.8.3 ( 080abeab) on runner-15-8-3 xxmtfKew, system ID: r_mh4GD7NqpOpbfeature flags: FF_KUBERNETES_HONOR_ENTRYPOINT:true...$ aws s3 ls2024-01-08 10:11:23 a05093c145b41f356171cbcb6af40a2a8049e65b...$ echo $KUBECONFIG/root/.kube/config$ kubectl get pods -n kube-systemNAME READY STATUS RESTARTS AGEaws-load-balancer-controller-85769fd4d5-7ccvz 1/1...Cleaning up project directory and file based variables 00:00Job succeeded
All works well. Now if we do the same thing, but with a 16.8.0 runner:
Running with gitlab-runner 16.8.0 ( c72a09b6) on runner-16-8-0 zsNpnZTV, system ID: r_RqwFr96tELCL feature flags: FF_KUBERNETES_HONOR_ENTRYPOINT:true...$ aws s3 ls2024-01-08 10:11:23 a05093c145b41f356171cbcb6af40a2a8049e65b...$ echo $KUBECONFIG/root/.kube/config$ kubectl get pods -n kube-systemW0208 15:42:24.102200 63 loader.go:222\] Config not found: /root/.kube/configError from server (Forbidden): pods is forbidden: User "system:serviceaccount:gitlab-runner:default" cannot list resource "pods" in API group "" in the namespace "kube-system" Cleaning up project directory and file based variables 00:01 ERROR: Job failed: command terminated with exit code 1 ..
OK. I put an exec "$0" at the end of my script, and it did something different at least. Though it is possible that this new behavior is because we upgraded to 16.8.2 for gitlab, and 16.8.1 for the runner. I wish we could have kept them the same as what we had before, but we are forced to upgrade pretty quickly because of our compliance people.
Now, it just seems to be hanging. It checks everything out, but then the job just waits:
Running on runner-k4dznj9z-project-21-concurrent-0-hckfosq8 via tooling-sandbox-infra-gitlab-infra-runner-gitlab-runner-6cnzfr5...Getting source from Git repository00:06Fetching changes with git depth set to 20...Initialized empty Git repository in /builds/lg/identity-devops/.git/Created fresh repository.Checking out 5190a707 as detached HEAD (ref is tspencer/non_idp_terraform_environments)...Updating/initializing submodules with git depth set to 20...Submodule 'identity-devops-private' (https://gitlab-ci-token:[MASKED]@[MASKED]/lg/identity-devops-private.git) registered for path 'identity-devops-private'Synchronizing submodule url for 'identity-devops-private'Cloning into '/builds/lg/identity-devops/identity-devops-private'...Submodule path 'identity-devops-private': checked out '979a3094cef9a1945185bf39fded397561936b0a'Updated submodulesEntering 'identity-devops-private'Entering 'identity-devops-private'Executing "step_script" stage of the job script
When I look at the pod, the actual build container seems to already have exited:
HOWEVER! I just looked at the logs from the build container, and apparently, I did something wrong, because my script exited very quickly with an error message. That's great, because I believe that things are now working. I am willing to bet that if I fix my script's problem that everything will be good.
If that is true, then I'll let you know.
But it seems strange that I had to go do a kubectl logs pod/runner-k4dznj9z-project-21-concurrent-0-hckfosq8 -n gitlab -c build to find that error message from my script rather than seeing it in the job log. And it also seems to be wrong that it didn't notice that the build container exited with an exit code of 1, and instead just is hanging until the 1h timeout happens.
I will let you know what I find. Thanks for your help so far! I look forward to your thoughts on why the build container that exited quickly seems to not be handled well by the runner.
OK. I put an exec "$0" at the end of my script, and it did something different at least. Though it is possible that this new behavior is because we upgraded to 16.8.2 for gitlab, and 16.8.1 for the runner. I wish we could have kept them the same as what we had before, but we are forced to upgrade pretty quickly because of our compliance people.
I will compare again v15.8.3 and v16.8.1 to see if I can see any changes that could explain this change of behaviour (I hope I will be more successful this time )
But it seems strange that I had to go do a kubectl logs pod/runner-k4dznj9z-project-21-concurrent-0-hckfosq8 -n gitlab -c build to find that error message from my script rather than seeing it in the job log
When reviewing !4545 (merged), I realized that although we monitor the overall status of the Pod, we do not consider individual container failures. This monitoring was added specifically for service containers.
For the build container, we rely on the trap command to catch any failure and forward the exit code to GitLab Runner. In this particular case, I suspect that the entrypoint failure is not being handled, as we do not receive the exit code from the trap command, and the script never returns due to the failure of the build container.
Have you observed any events related to the failure of the build container? If not, enabling FF_PRINT_POD_EVENTS should forward all job Pod-related events to the job log. If there are no such events, I believe we may need to explicitly monitor change of states for the build container to catch failure from entrypoint and automatically cancel the ongoing job.
Have not observed any "events", just the output from my job exiting right away saying "TARGET not specified: aborting" in the kubectl logs from the pod.
I finally got clear of other obligations, and should have some time to fix my script that the job runs and get you some more info. I will also try FF_PRINT_POD_EVENTS, which really looks like something that I should be using more often when debugging this stuff. :-). Thanks!
We had to upgrade to 16.9.1 last week, and thus this time around, it is behaving differently. Instead of hanging for 1h until the job times out, it exited right away.
Running on runner-zrohszst-project-21-concurrent-0-mb6tia1k via tooling-sandbox-infra-gitlab-infra-runner-gitlab-runner-6cnzfr5...Getting source from Git repository00:07Fetching changes with git depth set to 20...Initialized empty Git repository in /builds/lg/identity-devops/.git/Created fresh repository.Checking out cbe103b7 as detached HEAD (ref is tspencer/non_idp_terraform_environments)...Updating/initializing submodules with git depth set to 20...Submodule 'identity-devops-private' (https://gitlab-ci-token:[MASKED]@gitlab.login.gov/lg/identity-devops-private.git) registered for path 'identity-devops-private'Synchronizing submodule url for 'identity-devops-private'Cloning into '/builds/lg/identity-devops/identity-devops-private'...Submodule path 'identity-devops-private': checked out 'XXX'Updated submodulesEntering 'identity-devops-private'Entering 'identity-devops-private'Normal UpdatedKumaDataplane Updated Kuma Dataplane: runner-zrohszst-project-21-concurrent-0-mb6tia1kExecuting "step_script" stage of the job script00:00Uploading artifacts for failed job00:00Uploading artifacts...WARNING: plan.json: no matching files. Ensure that the artifact path is relative to the working directory (/builds/lg/identity-devops) ERROR: No files to upload Cleaning up project directory and file based variables00:01ERROR: Job failed (system failure): unable to upgrade connection: container not found ("build")
So it seems like the problem that we saw in 16.8.2 is gone, because it identified that the container was dead right away and failed the job. Yay! Software is getting better!
Well, I am now confused, because in debugging my entrypoint script, I noticed that $CI_PROJECT_DIR did not exist, which was why my script was aborting right away. So I put a sleep at the top of my script, and it now works:
until [ -d "$CI_PROJECT_DIR" ] ; do echo waiting for "$CI_PROJECT_DIR" to appear sleep 10done
So there's some sort of race condition where the build is running before the runner (presumably) execs into the container and checks out all the code and so on. This is fine, though. I can handle that. Though if you know if there's a better way to know when the code checkout and other setup is done for me to wait on, I would love to know it.
However, I decided to check that the entrypoint was actually being honored, so I told the job to just run echo entrypoint not honored, and instead of running my script, it ran the echo. So it appears as if there's still something wrong, because I was expecting it to honor the entrypoint:
# deploy the all/tooling jobdeploy_all_tooling: extends: .tf_deploy_template tags: - tooling-sandbox-infra-runner environment: name: all/tooling resource_group: all_tooling variables: TARGET: all/tooling GIT_SUBMODULE_STRATEGY: normal EKS_CLUSTER: tooling-sandbox-infra FF_PRINT_POD_EVENTS: "true" rules: - if: '$CI_PIPELINE_SOURCE == "merge_request_event"' when: never - if: $CI_SERVER_HOST != "gitlab.login.gov" when: never - if: $CI_PIPELINE_SOURCE == "schedule" when: never # XXX change this to main when we are done - if: $CI_COMMIT_BRANCH == "tspencer/non_idp_terraform_environments" when: always script: # XXX We need to turn on FF_KUBERNETES_HONOR_ENTRYPOINT in the runner. See infra-runner-values.yaml # - /usr/local/bin/tf_deploy.sh - echo bad entrypoint
What can I do to help figure this out? Thanks again!
I think I figured out what is going on! When the job runs, the build container runs and properly executes it's entrypoint. I can see the logs of it running it's task through kubectl logs runner-zrohszst-project-21-concurrent-0-imin268i -n gitlab -c build, and it's running the proper entrypoint script specified in the image.
However it is also running the script specified in the job, and the output of that is what is shown in the job log in gitlab, and is what determines whether the job is successful or not. I suspect that this is run with a kubectl exec after it's checked out all the code and so on.
I would have expected a runner with FF_KUBERNETES_HONOR_ENTRYPOINT set to just run the build container and watch the output and exit value from that default entrypoint, and NOT be allowed to exec any user-supplied commands into the container.
Yes, sure enough, when I run kubectl logs -f, I see the default entrypoint script running, but no output on the job page. But at the end, it runs the script specified in the job:
<most of my entrypoint script is edited out here>Apply complete! Resources: 0 added, 0 changed, 0 destroyed.terraform apply completed on Tue Mar 5 00:39:42 UTC 2024{"script": "/scripts-21-1061417/step_script"}$ echo "Entrypoint not honored!" ; exit 1Entrypoint not honored!{"command_exit_code": 1, "script": "/scripts-21-1061417/step_script"}
I do see the step_script output in the job, though, and it is what determines the exit status of the job:
Executing "step_script" stage of the job script$ echo "Entrypoint not honored!" ; exit 1Entrypoint not honored!Uploading artifacts for failed jobUploading artifacts...plan.json: found 1 matching artifact files and directories Uploading artifacts as "terraform" to coordinator... 201 Created id=1061417 responseStatus=201 Created token=glcbt-64Cleaning up project directory and file based variablesERROR: Job failed: command terminated with exit code 1
Any thoughts? This seems like the bug. We should be able to see the logs from the container, and we shouldn't be having the step script run if we have FF_KUBERNETES_HONOR_ENTRYPOINT set, right?
Apologies for the delay to get back to you . I hope it is still March 29th on your side.
So there's some sort of race condition where the build is running before the runner (presumably) execs into the container and checks out all the code and so on. This is fine, though. I can handle that. Though if you know if there's a better way to know when the code checkout and other setup is done for me to wait on, I would love to know it.
That makes sense the entrypoint is executed as soon as the init-container is completed. So there is no guarantee that the git clone would be done on time if it is needed in the image entrypoint.
I think I figured out what is going on! When the job runs, the build container runs and properly executes it's entrypoint. I can see the logs of it running it's task through kubectl logs runner-zrohszst-project-21-concurrent-0-imin268i -n gitlab -c build, and it's running the proper entrypoint script specified in the image.
However it is also running the script specified in the job, and the output of that is what is shown in the job log in gitlab, and is what determines whether the job is successful or not. I suspect that this is run with a kubectl exec after it's checked out all the code and so on.
So, everything you observed and described above is accurate. Without FF_KUBERNETES_HONOR_ENTRYPOINT, GitLab Runner overrides the image existing entrypoint and replaces it with detect_script_shell, which opens a shell to keep the container open until the job script is executed.
When FF_KUBERNETES_HONOR_ENTRYPOINT is set, GitLab Runner lets the image entrypoint be executed (expecting it to keep the container running), but the job script is still executed in the step_script stage (in exec or attach mode).
I would have expected a runner with FF_KUBERNETES_HONOR_ENTRYPOINT set to just run the build container and watch the output and exit value from that default entrypoint, and NOT be allowed to exec any user-supplied commands into the container.
The behavior described is not a bug but the expected behaviour
As the image entrypoint is not writing in these logs, you won't see them in the job log. I don't think redirecting the image entrypoint in this file is the way to go. So many things could go wrong:
Race condition (unless you create the file first at the right location and with 777 and even there I am not sure nothing will go wrong)
Log collision: everything will be messed up and potentially exposing some masked variables if any
probably other things I can't think off.
My suggestion in your use case would be to, with the existing features on GitLab Runner:
OK. That's interesting. So I was expecting FF_KUBERNETES_HONOR_ENTRYPOINT to function like disable_entrypoint_overwrite does for the docker runners. Thank you for the info on how this stuff works under the hood. I guess I was expecting your code checkout and other initialization stuff to get done with init containers and then you would get logs from the k8s log endpoint. It's probably too much to ask for you to rewrite things to do it that way rather than your helper exec system, right? :-) That way, you could just make it so that if FF_KUBERNETES_HONOR_ENTRYPOINT was set, you'd just remove the command: from the pod spec, and everything would be easy for me! :-)
Unfortunately, if I do your suggestion, that would break our security model, which is to have runners that will only run containers with the hardcoded entrypoint (and also to be able to view the logs from that entrypoint in the gitlab job log). We do not want them to be able to run arbitrary commands that the developers can put into the job definition. This works fine with our docker runners right now with disable_entrypoint_overwrite, but we need something like this for kubernetes.
If not this feature, then is there another way to duplicate the docker runner disable_entrypoint_overwrite functionality in kubernetes?
It's probably too much to ask for you to rewrite things to do it that way rather than your helper exec system, right? :-)
It was previously done this way but there was an issue with kubectl log where the log will hang. I don't have all the history but I have already asked this question in the past
If not this feature, then is there another way to duplicate the docker runner disable_entrypoint_overwrite functionality in kubernetes?
I am not familiar with the executordocker so I will have to see how it works before answering. I need first to check how the logs are streamed with the executordocker and see what can be done there
Thank you for your help! Let me know if there's anything I can do to help figure out a solution for this, and I look forward to hearing what you find out.
I am still trying to figure out David issue. Once it is done (and fix), I am going to close this issue and create a new one which reflect our discussion in this thread (Allow the streaming of image entrypoint logs in the job log for the ~"executor::kubernetes"). Does it work for you ?
I guess that would work. I just need a way to make it so that if FF_KUBERNETES_HONOR_ENTRYPOINT is set, any commands set in the job definition won't get run, and the output and the exit value of the container entrypoint will be visible and used as the overall exit value for the job.
And ideally, you would be running the code checkout and other initialization stuff in an init container too so I wouldn't have to implement a "wait until the code checkout stuff is done" loop in my script, but that's a stretch goal. :-)
I had a look on disable_entrypoint_overwrite with the executordocker and even there the job script/commands are being executed.
Running with gitlab-runner development version (HEAD) on Local GitLab Runner for tests and debugging 7SzKHLyus, system ID: s_b1aacad1f7fa feature flags: FF_NETWORK_PER_BUILD:true, FF_SCRIPT_SECTIONS:true, FF_KUBERNETES_HONOR_ENTRYPOINT:true, FF_USE_ADVANCED_POD_SPEC_CONFIGURATION:true, FF_PRINT_POD_EVENTS:true, FF_USE_DUMB_INIT_WITH_KUBERNETES_EXECUTOR:truePreparing the "docker" executor00:02Using Docker executor with image aaro301/alpine:custom-loop-2 ...Starting service redis:latest ...Using locally found image version due to "if-not-present" pull policyUsing docker image sha256:f361c7d940d45e7c8101d0cc356c21014b5c954444486b346f16df57cf1bed9e for redis:latest with digest redis@sha256:e647cfe134bf5e8e74e620f66346f93418acfc240b71dd85640325cb7cd01402 ...Waiting for services to be up and running (timeout 30 seconds)...[service:redis-redis_service] 2024-04-05T15:00:55.951256259Z 1:C 05 Apr 2024 15:00:55.951 # WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. Being disabled, it can also cause failures without low memory condition, see https://github.com/jemalloc/jemalloc/issues/1328. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.[service:redis-redis_service] 2024-04-05T15:00:55.953017761Z 1:C 05 Apr 2024 15:00:55.952 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo[service:redis-redis_service] 2024-04-05T15:00:55.953025637Z 1:C 05 Apr 2024 15:00:55.952 * Redis version=7.2.4, bits=64, commit=00000000, modified=0, pid=1, just started[service:redis-redis_service] 2024-04-05T15:00:55.953027428Z 1:C 05 Apr 2024 15:00:55.952 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf[service:redis-redis_service] 2024-04-05T15:00:55.953496220Z 1:M 05 Apr 2024 15:00:55.953 * monotonic clock: POSIX clock_gettime[service:redis-redis_service] 2024-04-05T15:00:55.954010305Z 1:M 05 Apr 2024 15:00:55.953 * Running mode=standalone, port=6379.[service:redis-redis_service] 2024-04-05T15:00:55.954571305Z 1:M 05 Apr 2024 15:00:55.954 * Server initialized[service:redis-redis_service] 2024-04-05T15:00:55.955080223Z 1:M 05 Apr 2024 15:00:55.954 * Ready to accept connections tcpUsing locally found image version due to "if-not-present" pull policyUsing docker image sha256:adec2dddd12a0e84020f78bc2045c5c9701a9852f2468d4a08af5b2bf6bf02b8 for aaro301/alpine:custom-loop-2 with digest aaro301/alpine@sha256:8ce4fc90ed6c4c4b976d979c83507701a968d19a16bcd0ff8ba4f73fe7a235e9 ...Preparing environment00:00Running on runner-7szkhlyus-project-25452826-concurrent-0 via touni-mbp...Getting source from Git repository00:01Fetching changes with git depth set to 2...Reinitialized existing Git repository in /builds/ra-group2/playground-bis/.git/Checking out a4c64cf8 as detached HEAD (ref is master)...Removing cache.txtSkipping Git submodules setupExecuting "step_script" stage of the job script00:21Using docker image sha256:adec2dddd12a0e84020f78bc2045c5c9701a9852f2468d4a08af5b2bf6bf02b8 for aaro301/alpine:custom-loop-2 with digest aaro301/alpine@sha256:8ce4fc90ed6c4c4b976d979c83507701a968d19a16bcd0ff8ba4f73fe7a235e9 ...Number of arguments: 3All arguments: $@ is sh -c if [ -x /usr/local/bin/bash ]; then exec /usr/local/bin/bash elif [ -x /usr/bin/bash ]; then exec /usr/bin/bash elif [ -x /bin/bash ]; then exec /bin/bash elif [ -x /usr/local/bin/sh ]; then exec /usr/local/bin/sh elif [ -x /usr/bin/sh ]; then exec /usr/bin/sh elif [ -x /bin/sh ]; then exec /bin/sh elif [ -x /busybox/sh ]; then exec /busybox/sh else echo shell not found exit 1fiIteration 1 - Current Time: 15:00:57Iteration 2 - Current Time: 15:00:58Iteration 3 - Current Time: 15:00:59Iteration 4 - Current Time: 15:01:00Iteration 5 - Current Time: 15:01:01Iteration 6 - Current Time: 15:01:02Iteration 7 - Current Time: 15:01:03Iteration 8 - Current Time: 15:01:04Iteration 9 - Current Time: 15:01:05Iteration 10 - Current Time: 15:01:06$ i=1Iteration for step_script 1 - Current Time: 15:01:07Iteration for step_script 2 - Current Time: 15:01:08Iteration for step_script 3 - Current Time: 15:01:09Iteration for step_script 4 - Current Time: 15:01:10Iteration for step_script 5 - Current Time: 15:01:11Iteration for step_script 6 - Current Time: 15:01:12Iteration for step_script 7 - Current Time: 15:01:13Iteration for step_script 8 - Current Time: 15:01:14Iteration for step_script 9 - Current Time: 15:01:15Iteration for step_script 10 - Current Time: 15:01:16Cleaning up project directory and file based variables00:00Job succeeded
As you pointed out, the logs generated by the container's entrypoint are also displayed in the job log. I've opened an issue for this: #37468 (closed). Please feel free to add any additional information you believe are relevant
Are you sure that the script is run? When I set this up, it does not run the script defined in the job, but instead runs the entrypoint script:
run_unallowed_script: stage: test allow_failure: true image: name: XXX.dkr.ecr.XXX.amazonaws.com/cd/env_deploy/blessed@sha256:XXX script: - touch unallowed_script.txt artifacts: paths: - "*.txt"test_unallowed_script: stage: test needs: - run_unallowed_script script: - echo looking for unallowed_script.txt - test ! -f unallowed_script.txt
Here's the run_unallowed_script output:
Executing "step_script" stage of the job script00:00Using docker image sha256:XXX for XXX.dkr.ecr.[MASKED].amazonaws.com/cd/env_deploy/blessed@sha256:XXX with digest XXX.dkr.ecr.[MASKED].amazonaws.com/cd/env_deploy/blessed@sha256:XXX ...+ set -e+ '[' -z /builds/timothy.spencer/test-script ']'+ '[' bravo '!=' '' ']'+ echo 'gitlab is asking us to deploy to , but I am in bravo. Aborting'gitlab is asking us to deploy to , but I am in bravo. Aborting+ exit 2Cleaning up project directory and file based variables00:00ERROR: Job failed: exit code 2
And here's the output of the test_unallowed_script job:
Skipping Git submodules setupExecuting "step_script" stage of the job script00:01Using docker image sha256:XXX for [MASKED].dkr.ecr.[MASKED].amazonaws.com/ecr-public/docker/library/alpine:latest with digest [MASKED].dkr.ecr.[MASKED].amazonaws.com/ecr-public/docker/library/alpine@sha256:XXX ...$ echo looking for unallowed_script.txtlooking for unallowed_script.txt$ test ! -f unallowed_script.txtCleaning up project directory and file based variables00:00Job succeeded
So it looks like the docker runner with --docker-disable-entrypoint-overwrite enabled isn't running the script in the job, but instead is running the entrypoint for the image. @ratchade, I'm not sure what the iteration script thing above is showing, but that is what I'm seeing. Let me know what you think!
After a little bit of fiddling, it looks like the docker executor may be trying to run it with the job script, but it's doing it by essentially doing docker run my/image sh -c if [ -x /usr/local/bin/bash ]; then .... Which is not bypassing the entrypoint.
If you run this image:
FROM alpine:latestCOPY fun.sh /fun.shENTRYPOINT ["/fun.sh"]
where fun.sh is:
#!/bin/shecho "arguments are: $@"echo "but we are sleeping 10 anyways"sleep 10
with these jobs:
run_unallowed_script: stage: test allow_failure: true image: name: gsatspencer/tspencersleepten script: - touch unallowed_script.txt artifacts: paths: - "*.txt"test_unallowed_script: stage: test needs: - run_unallowed_script script: - echo looking for unallowed_script.txt - test ! -f unallowed_script.txt
It will give you this output:
Skipping Git submodules setupExecuting "step_script" stage of the job script00:11Using docker image sha256:25d057cc32e70847338f5b1f9613e43e4302f7240e173803052412b45b442a7e for gsatspencer/tspencersleepten with digest gsatspencer/tspencersleepten@sha256:c29a87fd74990e3c80e8e21e3017e94a812c74a0531645b840e507b0212ab296 ...arguments are: sh -c if [ -x /usr/local/bin/bash ]; then exec /usr/local/bin/bash elif [ -x /usr/bin/bash ]; then exec /usr/bin/bash elif [ -x /bin/bash ]; then exec /bin/bash elif [ -x /usr/local/bin/sh ]; then exec /usr/local/bin/sh elif [ -x /usr/bin/sh ]; then exec /usr/bin/sh elif [ -x /bin/sh ]; then exec /bin/sh elif [ -x /busybox/sh ]; then exec /busybox/sh else echo shell not found exit 1fibut we are sleeping 10 anywaysUploading artifacts for successful job00:00Uploading artifacts...WARNING: *.txt: no matching files. Ensure that the artifact path is relative to the working directory (/builds/timothy.spencer/identity-idp-wargames) ERROR: No files to upload Cleaning up project directory and file based variables00:01Job succeeded
Which shows that the entrypoint is not bypassed. Is there a way to do this sort of thing with the k8s runner?
Just tried the 16.9.1 runner .. nothing solved. The last working runner for us is still 15.8.3.
Here's the relevant piece of logging from the pipelines:
With 16.9.1:
$ echo $KUBECONFIG/root/.kube/config$ kubectl get pods -n kube-systemW0308 07:47:24.827205 64 loader.go:222] Config not found: /root/.kube/configError from server (Forbidden): pods is forbidden: User "system:serviceaccount:gitlab-runner:default" cannot list resource "pods" in API group "" in the namespace "kube-system"
With 15.8.3:
$ echo $KUBECONFIG/root/.kube/config$ kubectl get pods -n kube-systemNAME READY STATUS RESTARTS AGEaws-load-balancer-controller-85769fd4d5-7ccvz 1/1 Running 0 59d
I don't think we're ever going to be able to move away from 15.8.3.
You know, the more I think about this, I believe that the bug is still a valid bug. The documentation says that when FF_KUBERNETES_HONOR_ENTRYPOINT is set, kubernetes will run the entrypoint. This implies that the job script will not be executed. Otherwise, what is the point of the feature? Presumably the entrypoint is being honored to either constrain what the job can execute, or perhaps some sort of pod initialization function. But relying on the entrypoint to initialize things sets you up for race conditions where the entrypoint is faster or slower than the job script, so things are likely to break. Thus, I think the point of the feature is to constrain what the job can execute, just like the --docker-disable-entrypoint-overwrite option does for the docker runner (#2625 (comment 87272957)).
This also resolves the ambiguity around what logs should be watched: when FF_KUBERNETES_HONOR_ENTRYPOINT is set, you should watch the entrypoint logs, instead of the job script logs. There should be no job script logs.
It may be true that the existing k8s execution system is functioning as you expect, but I think the FF_KUBERNETES_HONOR_ENTRYPOINT feature is still broken, and should mirror --docker-disable-entrypoint-overwrite. This issue should still be open.
You know, the more I think about this, I believe that the bug is still a valid bug. The documentation says that when FF_KUBERNETES_HONOR_ENTRYPOINT is set, kubernetes will run the entrypoint. This implies that the job script will not be executed.
Even though I agree with the points made in the 1st paragraph WRT race conditions, the conclusion made is not correct. When honoring the image entrypoint, our contract with the image is that it always needs to provide a valid shell (before or after some needed initialization depending on the user needs) so that we can executed the job script.
I am thinking about updating the documentation to clarify this and avoid any confusion on how the entrypoint itself works
Thus, I think the point of the feature is to constrain what the job can execute, just like the --docker-disable-entrypoint-overwrite option does for the docker runner ( #2625 (comment 87272957)).
I, somehow, am not able to see the comment link. I checked the comment links in this issue and none is the one linked. It should be been those two comments #2625 (comment 81692416) and #2625 (comment 87606183) but I can't see it . Would you mind copying it here ?
You know, I can't see it anymore either. Yet I pasted it in somehow. I don't know where it went or in detail what it said anymore. I am sure that you would have been 100% convinced by this comment to my side, though! :-)
So the comment there was basically restating what was said in the discussion in #2625 (comment 53423172) and #2625 (comment 62049700), which outlines the need for us to be able to have a more secure setup where people cannot specify arbitrary commands that the runner will run.
This is a common security practice for allowing users to execute narrowly defined privileged commands, like sudo, where you can specify commands that can be run by users as root, or the ssh command= option in authorized_keys. If the docker runner has this functionality, then the k8s runner should too.
You know, I can't see it anymore either. Yet I pasted it in somehow. I don't know where it went or in detail what it said anymore. I am sure that you would have been 100% convinced by this comment to my side, though! :-)
I really thought something was wrong with me as I couldn't find this comment
Yeah, I would rather they not add those extra arguments in either, but it's a bug that I can live with, because I set my entrypoints to ignore all arguments and just do their job.
I really hesitate to suggest a solution, because I don't know the code well enough yet, but surely it wouldn't be too hard to put a check in when you are creating the script to exec into the pod, and if FF_KUBERNETES_HONOR_ENTRYPOINT is set, to not write the job script in, or to replace it with echo FF_KUBERNETES_HONOR_ENTRYPOINT is set, not executing job script?
It's not as good as using real init containers and removing the command in the pod spec, but I'd imagine that this would be a smaller/easier change for you to write.
The first issue I am seeing here is, we might not be able to know when to stop the job, at least with the executorkubernetes, as
GitLab Runner is not aware of the entrypoint log
We rely on a specific json message in the log in attach mode to detect the end of an ongoing stage
All this to say that it might not be as easy as it seems.
I also asked to other team members during our call today, as there is already some ongoing work that would enable the ability to run a job without setting any script.
A workaround was suggested but I want to see if it would actually work with the Kubernetes executor (especially because of the json thing ) and get back to you.
Yeah, I guess I forgot that you'd still have to do the work to get the logs from the container and the exit value and so on. I guess there isn't an easy solution. Thanks for listening to my ignorant ideas! :-)
Or actually, I would be totally fine if you created a new FF_KUBERNETES_DISABLE_ENTRYPOINT_OVERWRITE feature that implemented this, if you believe that FF_KUBERNETES_HONOR_ENTRYPOINT is doing what you want it to do.
We just need the same functionality that the docker runner has, as expressed in #2625 (comment 87272957).
Well, as I said before, we just need the same functionality as the docker runner's disable_entrypoint_overwrite feature, for the same reasons that the people in that issue were talking about. The key features being:
the ability for the runner to be configured so that it would run jobs without executing job scripts, no matter what was in the .gitlab-ci.yml file.
To be able to see the output from the entrypoint and use it's exit value for the job status.
This is what FF_KUBERNETES_HONOR_ENTRYPOINT looks like it's supposed to do. Updating the documentation may be a way to sidestep this issue, but I would argue that this is still something that needs to be addresed, and should still get some engineering time. The same types of people who are using the disable_entrypoint_overwrite feature are going to want the same functionality in their k8s runners.
Thanks for reading all my words! I know that there is no easy way to make me happy here, but I hope you will consider working at it anyways! :-)
I was going to do a summary of what is expected so thank you for doing it.
The same types of people who are using the disable_entrypoint_overwrite feature are going to want the same functionality in their k8s runners.
That is a good point
I feel like I should also create another issue to keep track of the first point.
The documentation update is mostly to avoid any misunderstanding TBH.
Thanks for reading all my words! I know that there is no easy way to make me happy here, but I hope you will consider working at it anyways! :-)
Thanks too for all the feedbacks provided. As I said above, the point 1 is already being implemented in https://gitlab.com/gitlab-org/step-runner. So it's unlikely to be implemented in Runner in its current form
I'm confused here. The step-runner just looks like a special container that you can use to run steps on. It doesn't limit what runners will execute. It looks like all the steps that the step runner executes are or at least can be defined in the .gitlab-ci.yml file still too. How does this solve point 1?
Maybe I should be more detailed in my definition for point 1:
We need: The ability for the runner to be configured so that it will only execute the entrypoint of the containers it runs. The gitlab server must not be able to specify what script the jobs in that runner will run, either by job scripts, entrypoint overrides, or any other config in .gitlab-ci.yml. The gitlab server is only trusted to specify what image to run. The runner itself enforces this policy.
So when you say "it's unlikely to be implemented in Runner in its current form", are you saying that you will not fix this bug? Is there a way forward for those of us who use the disable_entrypoint_overwrite feature?
I am referring to the ability to run a job without running the job script as an ongoing work should cover this point.
Is there a way forward for those of us who use the disable_entrypoint_overwrite feature?
I added on my todo list to understand how the disable_entrypoint_overwrite actually works and see if it possible to mimic this with the executorkubernetes. You highlight a great point previously (below) .
The same types of people who are using the disable_entrypoint_overwrite feature are going to want the same functionality in their k8s runners.
@ratchade I wanted to flag this issue for you since @timspencer has some additional topics to work through. Thanks a ton for the engagement so far on this!
@ratchade & @DarrenEastman I wanted to check in with you about this issue. What's our next step here, given that this is a major blocker for our customer? CC: @rcain
@ssharer1 I see there is an open MR to revert the change. Romauld is back from PTO next week so will be able to provide more details as to the next set of engineering tasks.
Hi @DarrenEastman and @ratchade wanted to check in to see what the next steps are here. Sorry to bother you but the customer is interested in getting this resolved as soon as possible. I see that there's an open MR for this but I haven't seen much activity on it lately. CC @rcain
In the following MR !4736, I started working on the ability to introduce a setting similar to DisableEntrypointOverwrite like Docker does (see comment !4736 (comment 1898018093))
I unfortunately got pulled on an another emergency but I believe I will be able to get back to it during this milestone.
I don't know yet though how this issue #37417 will impact my implementation.
@ratchade@DarrenEastman Just wanted to follow-back up on the status of this issue and the associated MR as the milestone was recently pushed back to %17.6.
The customer has a temporary workaround in place, but it is not ideal to maintain during upgrade cycles and they would like to know when they might be able to expect this fix to be delivered. Thanks!