This meta issue has been closed in favor of the single source of truth in epic &337 (closed) and should only be considered a source of historical discussion.
Problem to solve
CI/CD jobs are executed by runners based on the configuration provided by users in their pipeline definition. But this execution is not interactive and, in case of failure, users cannot dig into details to spot the possible source of the problem.
When a test fails, users may want to inspect files and to run commands. This is a huge improvement to the current status, where output is the only feedback they get.
This would be amazing. One of the biggest complaints with any CI system is debugging the CI jobs when they break. Pushing garbage commit's through the pipeline is less than ideal. Of course you can run the gitlab runner locally but that takes quite a bit of setup.
Is the idea to allow this for all Job types that the runners support? (Docker, autoscaling docker, SSH, etc.)
Being able to SSH into CI is indeed very valuable. It does require leaving the job container lying around for some period of time (like 30 minutes) just so you can connect to it. Or run it again and connect the terminal to the new container.
Yeah that sounds like something feasible. Keep the container around by default for 0 seconds but allow config in both gitlab-runner config.toml and ci file to keep for up to 1 hour? And start and connect to it via the UI?
Would you like me to write up some more specs and scope?
It should be possible to stop the container and re-run it on the same runner when a user wants to debug it. In addition to this, it would take a garbage collector that destroys the containers too old.
@alexispires Depends on your runner usage. For GitLab.com, for example, we flush each runner after a single job is run. There's no re-use of runners at all for security purposes.
I'm focusing on the Kubernetes executor because web terminal integration for it already exists in environments and review apps. The way I imagine this will work is similar to what CircleCI does:
By default, web terminal is disabled for builds
Every completed build displays a Retry with Web Terminal Access button
Once the build starts, the user can click the web terminal icon and connect
We'll retain the Kubernetes pod/container for a fixed period (say, 30 minutes) after the build ends, in case they want to inspect the final state
If the user no longer needs the terminal, they can click a button to free up retained resources (what should this be called? A "stop" icon with a tooltip might suffice)
Alternatively, we could try to make it so that any running build on Kubernetes has web terminal enabled, and we retain the container after build completion as long as the user is connected (upto a max retention period of, say, 30 minutes or 1 hour). This is a more flexible approach for the following reasons, although I'm yet to evaluate its technical feasibility:
To troubleshoot failed builds (which was the original use-case), the user can simply click the existing Retry button and connect
In addition, it satisfies use-cases where logging into a running build container would be handy even though there were no build failures - concrete examples of such use-cases are welcome
Could someone from the UX or product team weigh in on this?
Proposed implementation
The code changes that I think will be required for this:
Connecting web terminal to CI build pods in Kubernetes via websockets (I've got this working)
Well. This gets tricky. I'm not sure whether we should mess with GitLab Runner responsibility on taking care of Pods, as this is GitLab Runner who takes care of Pods life cycle. If we go this way we might end-up with situations where we dual brain.
I would rather see that all communication to terminal goes through GitLab Runner. GitLab Runner can expose communication channel which is secured, GitLab does forward websocket connection to GitLab Runner. GitLab Runner is in full control of the process and the lifecycle of the pod. It also makes it possible to work with every type of executor, not only Kubernetes and it does not require any special labeling as Runner knows exactly what was created.
The above requires creating slightly more to perform gluing all needed parts.
Well, an even simpler solution would be for Runner to create a secure SSH server for each run build which could be used by the user with a login and password to the remote machine provided by Runner. Then you could use external tools for accessing the terminal.
I'm not sure whether we should mess with GitLab Runner responsibility on taking care of Pods, as this is GitLab Runner who takes care of Pods life cycle
I would rather see that all communication to terminal goes through GitLab Runner.
Currently, websocket connections go through gitlab-workhorse for environments/review apps. If Runner manages its own websocket connections, won't it end up duplicating some logic from there?
I only vaguely understand what you mean by "dual brain". Could you give a concrete example of why that could be a problem? Just so I can understand the nuances involved.
The above requires creating slightly more to perform gluing all needed parts.
Yes. At the moment it's unclear to me how much gluing your approach would involve. Currently the logic of interacting with Kubernetes is in the model, I'm not sure how much of that would have to change. Also the interaction between Gitlab and Runner is currently not clear to me - it might be worth adding it to the architecture diagram.
Well, an even simpler solution would be for Runner to create a secure SSH server
Yes, that's an alternative (https://gitlab.com/gitlab-org/gitlab-ce/issues/22319). But the rationale behind providing web terminal for CI would be the same as the rationale behind providing it for environments/review apps. In other words, if shell access is sufficient for CI builds, it should have been sufficient for environments/review apps. I suspect there are good reasons (either from the user perspective or from implementation perspective) to favour the web terminal approach although I'm not aware what they are - would appreciate some insight on this point.
I think the reason to prefer the web terminal approach, at least for Kubernetes, is simply that setting up ingress from outside the cluster into the running pod is non-trivial (it would require deploying additional components in the cluster that would manage ingress, if I understand correctly). This is also hinted at in the 8.15 release post:
Working together with your container scheduler, GitLab happily spins up several (dynamic) environments on request for your projects. Be that for review apps or a staging or production environment. Traditionally, getting direct access to these environments has been a little painful. And that's a shame: it's very useful to quickly try something in a live environment to debug a problem, or just to experiment.
The Kubernetes approach is definitely MVP, but also it makes things slightly harder as we have two sources of truth: GitLab Rails and Runner. I wonder if we can figure out a way for us to make it ambiguous for Runner and effectively make it work with every executor.
Currently, we use can use Kubernetes API, but our implementation is very generic and it can talk to any endpoint, so technically it is feasible for Runner to expose websocket which could be used for that purpose. If runner provides a WSS, with some token authentication, GitLab would connect to Runner, and Runner would perform kubectl exec, docker exec, or whatever and would be able to understand and keep the job alive for as long as it was requested by the Rails. Otherwise, you are hitting a dual brain problems: 1. runner finished, it is gonna delete, 2. you connected via exec, but you want to keep it running, 3. we cannot instruct runner to not delete as then you have to reimplement the same logic for cleanup on Rails side.
thanks @vickyc and @ayufan for the technical evaluation of this scenario.
Related to the user flow, the "retry with terminal" option unfortunately doesn't cover the inspection of situations where running a job again will not lead to the error (flaky tests) or may differ for any other reason. Maybe this can be a possible flow, but we could provide also a way to run with terminal for every job, maybe with a specific keyword in .gitlab-ci.yml.
So, default is to run without terminal, allow retry with terminal. Keyword allows a job to run directly with terminal.
@bikebilly Well, how is that different from just running a job? Maybe it is more like: we always allow you to enter into the build, and keep it as long as you are attached (but with some time limit like 30minutes). Then, there is no need to have: Retry with terminal, as this is basically the same, always.
@ayufan perfect! I was thinking keeping images for every job could cause problem for performances or infrastructure, but if this is not the case we can just keep images for some time after the job ends, and provide a way to jump in.
We always allow you to enter into the build, and keep it as long as you are attached (but with some time limit like 30minutes). But if you don't attach, we just destroy the container as usual. This means that you will never be able to attach after a build completes. So if a user doesn't use this feature, there will be no behaviour change or performance penalty for them.
In your use-case of flaky tests, you can simply run the build and attach to it while it's running. If the build fails, you have 30 minutes to do your debugging. If it succeeds, you retry the job as usual and attach to the new one. Hope that makes sense
Oh I see. It is valuable for sure, but still doesn't solve that you may know that you want to investigate only after the job ended (failed). In that case you have to rerun and attach immediately, that is absolutely an awesome thing, but not when jobs are not "replicable".
In your use-case of flaky tests, you can simply run the build and attach to it while it's running. If the build fails, you have 30 minutes to do your debugging.
If you split your tests in, let's say, 50 jobs, it is quite hard to attach all of them and you don't know where your flaky is in advance.
Which are the possible problems to keep all the jobs "available" for 30' or so, even if not asked, and allow attaching even after they are ended? If this is not doable, we can proceed as described and don't consider the edge cases (like flaky tests).
@bikebilly This implies a very high cost of keeping jobs attached. I see a value in that. We might force to delay cleanup to always keep job "available" for limited time, like 30s-1m after the failure, but not 30 minutes.
This is also a problem of people that has runner with single concurrent job set. We cannot run next job until this one is destroyed.
@ayufan yes I see the problem, but I'm not sure 1' is enough to go there and attach.
Let's move with the "attach while running" approach, seems the best first iteration and we can improve in the future if some scenario becomes so important to justify the effort.
Update: I've got executor-agnostic terminals working using the following MRs, it currently works with Shell and Kubernetes executors and can easily be extended to others. @ayufan could you take a look at these MRs when you have some time?
@ayufan Great, thanks! Just a quick note: I have a few hours free this week, but will not have coding time over the next 2 weeks as I'm travelling (I will still be able to respond to comments and make smaller changes like docs though)
We discussed with @vickyc. It looks great. The biggest changes are:
Move the terminal server out of runner registration, into build request, Runner creates a session to connect and it sends information about this to GitLab where it is stored on per-build,
Instead to refering to this information as terminal_server it is more like a runner_session_url and terminal is just one of the endpoints that are exposed by that runner facing API,
We are likely to generate self-signed certificate on Runner in order to always use wss:// and perform certificate validation in order to provide a secure communication channel,
I'm assigning that to myself to provide working PoC of Runner changes. The @jramsay Platform team gonna help in getting Rails code changes. The needed Workhorse changes are merged.