This meta issue has been closed in favor of the single source of truth in epic &337 (closed) and should only be considered a source of historical discussion.
Problem to solve
CI/CD jobs are executed by runners based on the configuration provided by users in their pipeline definition. But this execution is not interactive and, in case of failure, users cannot dig into details to spot the possible source of the problem.
When a test fails, users may want to inspect files and to run commands. This is a huge improvement to the current status, where output is the only feedback they get.
This would be amazing. One of the biggest complaints with any CI system is debugging the CI jobs when they break. Pushing garbage commit's through the pipeline is less than ideal. Of course you can run the gitlab runner locally but that takes quite a bit of setup.
Is the idea to allow this for all Job types that the runners support? (Docker, autoscaling docker, SSH, etc.)
Being able to SSH into CI is indeed very valuable. It does require leaving the job container lying around for some period of time (like 30 minutes) just so you can connect to it. Or run it again and connect the terminal to the new container.
Yeah that sounds like something feasible. Keep the container around by default for 0 seconds but allow config in both gitlab-runner config.toml and ci file to keep for up to 1 hour? And start and connect to it via the UI?
Would you like me to write up some more specs and scope?
It should be possible to stop the container and re-run it on the same runner when a user wants to debug it. In addition to this, it would take a garbage collector that destroys the containers too old.
@alexispires Depends on your runner usage. For GitLab.com, for example, we flush each runner after a single job is run. There's no re-use of runners at all for security purposes.
I'm focusing on the Kubernetes executor because web terminal integration for it already exists in environments and review apps. The way I imagine this will work is similar to what CircleCI does:
By default, web terminal is disabled for builds
Every completed build displays a Retry with Web Terminal Access button
Once the build starts, the user can click the web terminal icon and connect
We'll retain the Kubernetes pod/container for a fixed period (say, 30 minutes) after the build ends, in case they want to inspect the final state
If the user no longer needs the terminal, they can click a button to free up retained resources (what should this be called? A "stop" icon with a tooltip might suffice)
Alternatively, we could try to make it so that any running build on Kubernetes has web terminal enabled, and we retain the container after build completion as long as the user is connected (upto a max retention period of, say, 30 minutes or 1 hour). This is a more flexible approach for the following reasons, although I'm yet to evaluate its technical feasibility:
To troubleshoot failed builds (which was the original use-case), the user can simply click the existing Retry button and connect
In addition, it satisfies use-cases where logging into a running build container would be handy even though there were no build failures - concrete examples of such use-cases are welcome
Could someone from the UX or product team weigh in on this?
Proposed implementation
The code changes that I think will be required for this:
Connecting web terminal to CI build pods in Kubernetes via websockets (I've got this working)
Well. This gets tricky. I'm not sure whether we should mess with GitLab Runner responsibility on taking care of Pods, as this is GitLab Runner who takes care of Pods life cycle. If we go this way we might end-up with situations where we dual brain.
I would rather see that all communication to terminal goes through GitLab Runner. GitLab Runner can expose communication channel which is secured, GitLab does forward websocket connection to GitLab Runner. GitLab Runner is in full control of the process and the lifecycle of the pod. It also makes it possible to work with every type of executor, not only Kubernetes and it does not require any special labeling as Runner knows exactly what was created.
The above requires creating slightly more to perform gluing all needed parts.
Well, an even simpler solution would be for Runner to create a secure SSH server for each run build which could be used by the user with a login and password to the remote machine provided by Runner. Then you could use external tools for accessing the terminal.
I'm not sure whether we should mess with GitLab Runner responsibility on taking care of Pods, as this is GitLab Runner who takes care of Pods life cycle
I would rather see that all communication to terminal goes through GitLab Runner.
Currently, websocket connections go through gitlab-workhorse for environments/review apps. If Runner manages its own websocket connections, won't it end up duplicating some logic from there?
I only vaguely understand what you mean by "dual brain". Could you give a concrete example of why that could be a problem? Just so I can understand the nuances involved.
The above requires creating slightly more to perform gluing all needed parts.
Yes. At the moment it's unclear to me how much gluing your approach would involve. Currently the logic of interacting with Kubernetes is in the model, I'm not sure how much of that would have to change. Also the interaction between Gitlab and Runner is currently not clear to me - it might be worth adding it to the architecture diagram.
Well, an even simpler solution would be for Runner to create a secure SSH server
Yes, that's an alternative (https://gitlab.com/gitlab-org/gitlab-ce/issues/22319). But the rationale behind providing web terminal for CI would be the same as the rationale behind providing it for environments/review apps. In other words, if shell access is sufficient for CI builds, it should have been sufficient for environments/review apps. I suspect there are good reasons (either from the user perspective or from implementation perspective) to favour the web terminal approach although I'm not aware what they are - would appreciate some insight on this point.
I think the reason to prefer the web terminal approach, at least for Kubernetes, is simply that setting up ingress from outside the cluster into the running pod is non-trivial (it would require deploying additional components in the cluster that would manage ingress, if I understand correctly). This is also hinted at in the 8.15 release post:
Working together with your container scheduler, GitLab happily spins up several (dynamic) environments on request for your projects. Be that for review apps or a staging or production environment. Traditionally, getting direct access to these environments has been a little painful. And that's a shame: it's very useful to quickly try something in a live environment to debug a problem, or just to experiment.
The Kubernetes approach is definitely MVP, but also it makes things slightly harder as we have two sources of truth: GitLab Rails and Runner. I wonder if we can figure out a way for us to make it ambiguous for Runner and effectively make it work with every executor.
Currently, we use can use Kubernetes API, but our implementation is very generic and it can talk to any endpoint, so technically it is feasible for Runner to expose websocket which could be used for that purpose. If runner provides a WSS, with some token authentication, GitLab would connect to Runner, and Runner would perform kubectl exec, docker exec, or whatever and would be able to understand and keep the job alive for as long as it was requested by the Rails. Otherwise, you are hitting a dual brain problems: 1. runner finished, it is gonna delete, 2. you connected via exec, but you want to keep it running, 3. we cannot instruct runner to not delete as then you have to reimplement the same logic for cleanup on Rails side.
thanks @vickyc and @ayufan for the technical evaluation of this scenario.
Related to the user flow, the "retry with terminal" option unfortunately doesn't cover the inspection of situations where running a job again will not lead to the error (flaky tests) or may differ for any other reason. Maybe this can be a possible flow, but we could provide also a way to run with terminal for every job, maybe with a specific keyword in .gitlab-ci.yml.
So, default is to run without terminal, allow retry with terminal. Keyword allows a job to run directly with terminal.
@bikebilly Well, how is that different from just running a job? Maybe it is more like: we always allow you to enter into the build, and keep it as long as you are attached (but with some time limit like 30minutes). Then, there is no need to have: Retry with terminal, as this is basically the same, always.
@ayufan perfect! I was thinking keeping images for every job could cause problem for performances or infrastructure, but if this is not the case we can just keep images for some time after the job ends, and provide a way to jump in.
We always allow you to enter into the build, and keep it as long as you are attached (but with some time limit like 30minutes). But if you don't attach, we just destroy the container as usual. This means that you will never be able to attach after a build completes. So if a user doesn't use this feature, there will be no behaviour change or performance penalty for them.
In your use-case of flaky tests, you can simply run the build and attach to it while it's running. If the build fails, you have 30 minutes to do your debugging. If it succeeds, you retry the job as usual and attach to the new one. Hope that makes sense
Oh I see. It is valuable for sure, but still doesn't solve that you may know that you want to investigate only after the job ended (failed). In that case you have to rerun and attach immediately, that is absolutely an awesome thing, but not when jobs are not "replicable".
In your use-case of flaky tests, you can simply run the build and attach to it while it's running. If the build fails, you have 30 minutes to do your debugging.
If you split your tests in, let's say, 50 jobs, it is quite hard to attach all of them and you don't know where your flaky is in advance.
Which are the possible problems to keep all the jobs "available" for 30' or so, even if not asked, and allow attaching even after they are ended? If this is not doable, we can proceed as described and don't consider the edge cases (like flaky tests).
@bikebilly This implies a very high cost of keeping jobs attached. I see a value in that. We might force to delay cleanup to always keep job "available" for limited time, like 30s-1m after the failure, but not 30 minutes.
This is also a problem of people that has runner with single concurrent job set. We cannot run next job until this one is destroyed.
@ayufan yes I see the problem, but I'm not sure 1' is enough to go there and attach.
Let's move with the "attach while running" approach, seems the best first iteration and we can improve in the future if some scenario becomes so important to justify the effort.
Update: I've got executor-agnostic terminals working using the following MRs, it currently works with Shell and Kubernetes executors and can easily be extended to others. @ayufan could you take a look at these MRs when you have some time?
@ayufan Great, thanks! Just a quick note: I have a few hours free this week, but will not have coding time over the next 2 weeks as I'm travelling (I will still be able to respond to comments and make smaller changes like docs though)
We discussed with @vickyc. It looks great. The biggest changes are:
Move the terminal server out of runner registration, into build request, Runner creates a session to connect and it sends information about this to GitLab where it is stored on per-build,
Instead to refering to this information as terminal_server it is more like a runner_session_url and terminal is just one of the endpoints that are exposed by that runner facing API,
We are likely to generate self-signed certificate on Runner in order to always use wss:// and perform certificate validation in order to provide a secure communication channel,
I'm assigning that to myself to provide working PoC of Runner changes. The @jramsay Platform team gonna help in getting Rails code changes. The needed Workhorse changes are merged.
@ayufan could you please take a look at the updated description to see if it accurate from an engineering perspective? Feel free to add any technical detail you think is relevant. Thanks
Implement the ability to connect to a running job session from the GitLab UI. Once opened, the terminal will allow users to interact with the session. The session will stay open after the end of the execution for a specific amount of time, to allow users to do further investigations. Then it will be automatically closed to free up resources.
@bikebilly@ayufan Will the session expire after CI finishes even if I'm in the middle of doing something via the terminal? Or does it wait for the end of CI/CD and the terminal to finish? Does it then stay open a few more minutes in case I want to reconnect? Will all CI/CD jobs (run on k8s) be connectable in this way? Or do I need to start the job with a specific action to be connectable? If I don't connect, will all jobs stay active for X minutes just in case I want to connect, or is that only for jobs that I connect to while the job is running? What if the job is really fast and I can't connect with the Y second window?
We discussed about these possible options to guarantee availability of jobs, and there were some concerns about the resource usage in the balance. I'll leave to @ayufan to comment on the technical aspects of the implementation.
It depends, but terminal will live as long as CI job, but no longer after 30mins of job. We might consider giving some controls to extend terminal job, probably.
This will be needed for Web IDE context. You don't want your terminal session to be closed early.
Each CI job gonna have terminal controls. No special care will be done to have terminal. You will be able to enter into every job that is running, and you have access.
The only requirement gonna be that Runner Manager will have to be accessible from GitLab. In case of GitLab.com, it means that Runner Manager has to be exposed on Public IP, which is not a big problem if you run on VPS or use docker-machine. It might need extra service if running on Kubernetes, as we have to expose runner IP.
Ideally, we wouldn't waste any compute keeping runners open if jobs finish without anyone connecting via terminal.
To handle short jobs that might finish before you can connect, you could extend the runner to persist for a while, but this has a potentially large cost associated with it. An alternative would be to let people retry jobs with terminal enabled to explicitly stay alive for some number of minutes after the job finishes. This would be valuable functionality anyway, so might be preferable to the wasted compute model.
Ideally, the runner wouldn't die while you're actively doing something with it. But you'll probably need some threshold where you'll kill it regardless. Could be 24 hours though. And ideally you'd detect idle sessions and kill it after say 1 hour of not typing anything. That's what heroku run does, iirc.
Is there any security concern about people being able to do things in terminal that they otherwise wouldn't be able to do? e.g. you lock down production so it can only be deployed to by Masters/Owners, but you allow Developers to deploy to production if it is via a MR that has proper approvals and passing CI/CD. This means that if a Developer put any malicious code in .gitlab-ci.yml, at least it would have had to have been approved. But now we let them "debug" their CI/CD pipeline, and run any command they want, including dropping the production database.
Ideally, we wouldn't waste any compute keeping runners open if jobs finish without anyone connecting via terminal.
Yes, that is the goal of that approach: to not waste resources.
To handle short jobs that might finish before you can connect, you could extend the runner to persist for a while, but this has a potentially large cost associated with it. An alternative would be to let people retry jobs with terminal enabled to explicitly stay alive for some number of minutes after the job finishes. This would be valuable functionality anyway, so might be preferable to the wasted compute model.
This is very interesting idea. If we would give a 1 minute threshould in case of failure, this would help probably. I expect that it might not be needed. If someone will want to access terminal I expect that they retry, and immediately click Terminal, so he will reserve this session.
Ideally, the runner wouldn't die while you're actively doing something with it. But you'll probably need some threshold where you'll kill it regardless. Could be 24 hours though. And ideally you'd detect idle sessions and kill it after say 1 hour of not typing anything. That's what heroku run does, iirc.
I'm fine with letting it leave, to some uber high timeout. So, letting Frontend to decide and show timeout of your session and all you to auto or manually extend it would solve that purpose.
Is there any security concern about people being able to do things in terminal that they otherwise wouldn't be able to do? e.g. you lock down production so it can only be deployed to by Masters/Owners, but you allow Developers to deploy to production if it is via a MR that has proper approvals and passing CI/CD. This means that if a Developer put any malicious code in .gitlab-ci.yml, at least it would have had to have been approved. But now we let them "debug" their CI/CD pipeline, and run any command they want, including dropping the production database.
This is interesting. I guess that we will only allow to run terminal only if you would normally have access to trigger that job, so we should support all access rules that we have today in place. This gets quite complicated, but this is the only safe approach. The first iteration could only limit that to Masters, but in the end we should allow Developers to access terminal of test jobs.
@dimitrieh Nope. Because we merged what we planned for the %11.1. Yes, we should talk about UX as the current Frontend (not exposed yet) is very minimalistic. So, we look at having if needed some frontend involved as all backend is done.
@stefan_test It is present. It requires you to provide your own runner from this MR: gitlab-runner!934 (merged). You can use it, but this is alpha quality work still.
@dimitrieh@erushton We gonna need a frontend person to finish this work. It is working, but have a bunch of bugs: resizing, loosing connection on changing tabs, like this. All the problems that are also present on Environments Terminal.
We might want to follow-up after Web IDE figures out how they want to use web terminal. As each CI job web terminal session is time limited to 30 minutes only. This is runner controlled.
If you switch tab or switch window, the terminal stops responding. It seems that connection is dropped.
Resizing
I would expect that the terminal takes most of the window space of the browser, or behaves similarly to build log. I would have to show you how it works. I don't expect that we toggle fullscreen mode.
This should work for both running jobs as well as completed jobs correct?
Should we be able to summon up a job without any job log just to have a live environment to work in? If so where would the base configuration come from?
If we support finished jobs we probably need the following options correct?
Restart without cache
Restart with cache
Way to notifiy the user when the job process is done and the user can begin inspecting what has happened
Will it make use of the existing pipeline scope if a finished job is being debugged, or does a new pipeline start?
Design wise:
Add an additional button on the sidebar inside of a job
I am considering a tabbed system that only becomes visible after pressing the button to debug. This would make it so that the terminal can live in the same place as the job log (just in different tabs), so you can easily switch between them (not losing the session).
Some way to indicate to the user there is a live process and how long it will remain working (prob on the debug tab page or sidebar!)
TODO
Some way to indicate this being a live process in the jobs/pipeline list (in that case we might want to offer an option to jump right into the job terminal from there)
I don't think it makes to much sense to offer this option from places where you cannot see extended details about that job. Therefore I think it makes most sense to put in the failed jobs tab in the pipeline detail page.
@dimitrieh We cannot really summon succeeded/failed jobs. Maybe one way would be restarting them, but it will not be that job, anymore. We were thinking about something that would keep this job alive for some interval like 1-5 minutes after failure and to make you notified that you can enter it, but we did not yet implement. Currently, you can enter into a job as long as it is running. So, in such case, the correct workflow is: you retry, and you enter the terminal session.
So, in such case, the correct workflow is: you retry, and you enter the terminal session.
@ayufan I had this in mind as well. I am thinking of making the debug button probably retry the job if it has finished already. That way we can show the button regardless of finished or running state of the job.
@filipa although I do not see a scrolling job log in the scope of this iteration, I do think it once again becomes something that we should think about implementing (this time, tested correctly, and proper mobile support) cc: @bikebilly
I don't think it makes to much sense to offer this option from places where you cannot see extended details about that job. Therefore I think it makes most sense to put in the failed jobs tab in the pipeline detail page.
I don't think we have enough time to also do this in this iteration
Wdyt yourself about that? If that button would be pressed on a finished build, it would retry it, and move us to the live terminal tab on the jobs page right away.
Not a lot of technical effort, just don't have the time: I'm OOO for a week and have 2 more deliverables that require my attention, there are 2 weeks left until code freeze, which means I have a week. Not even sure that I can do the rest, tbh.
Maybe we can plan these bigger improvements for %11.3? We could simply finish bug fixes first, and then have enough time to iterate on terminal frontend. Would that be better. Right now we have a big deliverable Junit and we should get it out of the door.