Meta: Interactive web terminals for CI/CD shell, k8s, and docker jobs

added ~19173 ~14213 labels

@markglenfletcher thanks for the labels!

This would be amazing. One of the biggest complaints with any CI system is debugging the CI jobs when they break. Pushing garbage commit's through the pipeline is less than ideal. Of course you can run the gitlab runner locally but that takes quite a bit of setup.

Is the idea to allow this for all Job types that the runners support? (Docker, autoscaling docker, SSH, etc.)

ConcourseCI has a similar feature which is extremely helpful in diagnosing broken jobs.

added ~910333 label

Being able to SSH into CI is indeed very valuable. It does require leaving the job container lying around for some period of time (like 30 minutes) just so you can connect to it. Or run it again and connect the terminal to the new container.

Yeah that sounds like something feasible. Keep the container around by default for 0 seconds but allow config in both gitlab-runner config.toml and ci file to keep for up to 1 hour? And start and connect to it via the UI?

Would you like me to write up some more specs and scope?

It should be possible to stop the container and re-run it on the same runner when a user wants to debug it. In addition to this, it would take a garbage collector that destroys the containers too old.

added ~122770 label

@alexispires Depends on your runner usage. For GitLab.com, for example, we flush each runner after a single job is run. There's no re-use of runners at all for security purposes.

@markpundsack Any future plans on this? I had a customer today asking for a feature like this.

added customer label

added [deprecated] Accepting merge requests label

@to1ne It's a good idea, but this is not scheduled at all. I added the ~"Accepting Merge Requests" label.

mentioned in issue #39018 (moved)

Is this a duplicate of #22319 (moved)? Or is this one just about web terminal access, while #22319 (moved) is about SSH terminal access?

I'd like to take this up.

Proposed design

I'm focusing on the Kubernetes executor because web terminal integration for it already exists in environments and review apps. The way I imagine this will work is similar to what CircleCI does:

By default, web terminal is disabled for builds
Every completed build displays a Retry with Web Terminal Access button
Once the build starts, the user can click the web terminal icon and connect
We'll retain the Kubernetes pod/container for a fixed period (say, 30 minutes) after the build ends, in case they want to inspect the final state
If the user no longer needs the terminal, they can click a button to free up retained resources (what should this be called? A "stop" icon with a tooltip might suffice)

Alternatively, we could try to make it so that any running build on Kubernetes has web terminal enabled, and we retain the container after build completion as long as the user is connected (upto a max retention period of, say, 30 minutes or 1 hour). This is a more flexible approach for the following reasons, although I'm yet to evaluate its technical feasibility:

To troubleshoot failed builds (which was the original use-case), the user can simply click the existing Retry button and connect
In addition, it satisfies use-cases where logging into a running build container would be handy even though there were no build failures - concrete examples of such use-cases are welcome

Could someone from the UX or product team weigh in on this?

Proposed implementation

The code changes that I think will be required for this:

Connecting web terminal to CI build pods in Kubernetes via websockets (I've got this working)
The build pod will need to be labelled in a way similar to what's used for web terminals in review apps, so that the pod for a given build can be found
A background job for destroying the pod 30 minutes after build completion
Database change to store whether a build has web terminal enabled or not (unless we go with the 2nd approach above)
UI changes

Feedback is welcome on any of this.

@vickyc That sounds awesome! So can we assign you to this issue?

I'm not too familiar with Kubernetes, so I've notified some people on chat hoping they can chime into this discussion.

cc @markpundsack @ayufan

Well. This gets tricky. I'm not sure whether we should mess with GitLab Runner responsibility on taking care of Pods, as this is GitLab Runner who takes care of Pods life cycle. If we go this way we might end-up with situations where we dual brain.

I would rather see that all communication to terminal goes through GitLab Runner. GitLab Runner can expose communication channel which is secured, GitLab does forward websocket connection to GitLab Runner. GitLab Runner is in full control of the process and the lifecycle of the pod. It also makes it possible to work with every type of executor, not only Kubernetes and it does not require any special labeling as Runner knows exactly what was created.

The above requires creating slightly more to perform gluing all needed parts.

Well, an even simpler solution would be for Runner to create a secure SSH server for each run build which could be used by the user with a login and password to the remote machine provided by Runner. Then you could use external tools for accessing the terminal.

@toon sure, you can assign it to me

@ayufan thank you for the feedback!

I'm not sure whether we should mess with GitLab Runner responsibility on taking care of Pods, as this is GitLab Runner who takes care of Pods life cycle

I would rather see that all communication to terminal goes through GitLab Runner.

Currently, websocket connections go through gitlab-workhorse for environments/review apps. If Runner manages its own websocket connections, won't it end up duplicating some logic from there?

I only vaguely understand what you mean by "dual brain". Could you give a concrete example of why that could be a problem? Just so I can understand the nuances involved.

The above requires creating slightly more to perform gluing all needed parts.

Yes. At the moment it's unclear to me how much gluing your approach would involve. Currently the logic of interacting with Kubernetes is in the model, I'm not sure how much of that would have to change. Also the interaction between Gitlab and Runner is currently not clear to me - it might be worth adding it to the architecture diagram.

Well, an even simpler solution would be for Runner to create a secure SSH server

Yes, that's an alternative (https://gitlab.com/gitlab-org/gitlab-ce/issues/22319). But the rationale behind providing web terminal for CI would be the same as the rationale behind providing it for environments/review apps. In other words, if shell access is sufficient for CI builds, it should have been sufficient for environments/review apps. I suspect there are good reasons (either from the user perspective or from implementation perspective) to favour the web terminal approach although I'm not aware what they are - would appreciate some insight on this point.

Let me think about this some more and get back.

assigned to @vickyc

I think the reason to prefer the web terminal approach, at least for Kubernetes, is simply that setting up ingress from outside the cluster into the running pod is non-trivial (it would require deploying additional components in the cluster that would manage ingress, if I understand correctly). This is also hinted at in the 8.15 release post:

Working together with your container scheduler, GitLab happily spins up several (dynamic) environments on request for your projects. Be that for review apps or a staging or production environment. Traditionally, getting direct access to these environments has been a little painful. And that's a shame: it's very useful to quickly try something in a live environment to debug a problem, or just to experiment.

mentioned in merge request gitlab-workhorse!234 (merged)

mentioned in merge request gitlab-runner!841 (closed)

mentioned in merge request !17281 (closed)

Thanks, @vickyc.

The Kubernetes approach is definitely MVP, but also it makes things slightly harder as we have two sources of truth: GitLab Rails and Runner. I wonder if we can figure out a way for us to make it ambiguous for Runner and effectively make it work with every executor.

Currently, we use can use Kubernetes API, but our implementation is very generic and it can talk to any endpoint, so technically it is feasible for Runner to expose websocket which could be used for that purpose. If runner provides a WSS, with some token authentication, GitLab would connect to Runner, and Runner would perform kubectl exec, docker exec, or whatever and would be able to understand and keep the job alive for as long as it was requested by the Rails. Otherwise, you are hitting a dual brain problems: 1. runner finished, it is gonna delete, 2. you connected via exec, but you want to keep it running, 3. we cannot instruct runner to not delete as then you have to reimplement the same logic for cleanup on Rails side.

thanks @vickyc and @ayufan for the technical evaluation of this scenario.

Related to the user flow, the "retry with terminal" option unfortunately doesn't cover the inspection of situations where running a job again will not lead to the error (flaky tests) or may differ for any other reason. Maybe this can be a possible flow, but we could provide also a way to run with terminal for every job, maybe with a specific keyword in .gitlab-ci.yml.

So, default is to run without terminal, allow retry with terminal. Keyword allows a job to run directly with terminal.

What do you think?

@bikebilly Well, how is that different from just running a job? Maybe it is more like: we always allow you to enter into the build, and keep it as long as you are attached (but with some time limit like 30minutes). Then, there is no need to have: Retry with terminal, as this is basically the same, always.

@ayufan perfect! I was thinking keeping images for every job could cause problem for performances or infrastructure, but if this is not the case we can just keep images for some time after the job ends, and provide a way to jump in.

we can just keep images for some time after the job ends, and provide a way to jump in.

@bikebilly I think @ayufan meant something a little different

We always allow you to enter into the build, and keep it as long as you are attached (but with some time limit like 30minutes). But if you don't attach, we just destroy the container as usual. This means that you will never be able to attach after a build completes. So if a user doesn't use this feature, there will be no behaviour change or performance penalty for them.

In your use-case of flaky tests, you can simply run the build and attach to it while it's running. If the build fails, you have 30 minutes to do your debugging. If it succeeds, you retry the job as usual and attach to the new one. Hope that makes sense

Exactly! We always allow attaching.

Oh I see. It is valuable for sure, but still doesn't solve that you may know that you want to investigate only after the job ended (failed). In that case you have to rerun and attach immediately, that is absolutely an awesome thing, but not when jobs are not "replicable".

In your use-case of flaky tests, you can simply run the build and attach to it while it's running. If the build fails, you have 30 minutes to do your debugging.

If you split your tests in, let's say, 50 jobs, it is quite hard to attach all of them and you don't know where your flaky is in advance.

Which are the possible problems to keep all the jobs "available" for 30' or so, even if not asked, and allow attaching even after they are ended? If this is not doable, we can proceed as described and don't consider the edge cases (like flaky tests).

@bikebilly This implies a very high cost of keeping jobs attached. I see a value in that. We might force to delay cleanup to always keep job "available" for limited time, like 30s-1m after the failure, but not 30 minutes.

This is also a problem of people that has runner with single concurrent job set. We cannot run next job until this one is destroyed.

@ayufan yes I see the problem, but I'm not sure 1' is enough to go there and attach.

Let's move with the "attach while running" approach, seems the best first iteration and we can improve in the future if some scenario becomes so important to justify the effort.

Does it work for you?

Yes, it does. It is clean and fulfills the needs of probably 95% of people.

cc @jramsay @DouweM

mentioned in merge request gitlab-runner!857 (closed)

mentioned in merge request !17833 (closed)

mentioned in merge request gitlab-workhorse!243 (closed)

Update: I've got executor-agnostic terminals working using the following MRs, it currently works with Shell and Kubernetes executors and can easily be extended to others. @ayufan could you take a look at these MRs when you have some time?

I also need some quick advice from you over on gitlab-runner!857 (closed) wrt authentication and vendoring, please see here. Thanks!

mentioned in issue #23966 (closed)

Thanks @vickyc! @ayufan do you think we can assign someone to help moving on the MRs?

@vickyc Thanks for doing that. I will review and we can prioritise this work on getting this merged! This is awesome set of the features!

Could you maybe now spend time on getting gitlab-workhorse change merged?

@ayufan Great, thanks! Just a quick note: I have a few hours free this week, but will not have coding time over the next 2 weeks as I'm travelling (I will still be able to respond to comments and make smaller changes like docs though)

We discussed with @vickyc. It looks great. The biggest changes are:

Move the terminal server out of runner registration, into build request, Runner creates a session to connect and it sends information about this to GitLab where it is stored on per-build,
Instead to refering to this information as terminal_server it is more like a runner_session_url and terminal is just one of the endpoints that are exposed by that runner facing API,
We are likely to generate self-signed certificate on Runner in order to always use wss:// and perform certificate validation in order to provide a secure communication channel,

changed milestone to %11.0

changed milestone to %11.1

I'm assigning that to myself to provide working PoC of Runner changes. The @jramsay Platform team gonna help in getting Rails code changes. The needed Workhorse changes are merged.

added Deliverable label

changed weight to 9

@DouweM

assigned to @ayufan

I'm also interested in this feature. I'm happy to help if need.

Meta: Interactive web terminals for CI/CD shell, k8s, and docker jobs

Problem to solve

Links / references

Designs

Child items ...

Activity

Proposed design

Proposed implementation

Meta: Interactive web terminals for CI/CD shell, k8s, and docker jobs

Problem to solve

Links / references

Relates to

Activity

Proposed design

Proposed implementation