Isn't the token field in the API result the only way to truly correlate with the data available locally on a Runner? The runner config.toml does not contain the integer id, the token is the only truly unique identifier present that can be used to guarantee identification, which is my main complaint that the local config.toml file does NOT include the registered Runner Id
@brett_jacobson On the Runner side you have the gitlab-runner verify command, that allows you to check if the runner entity is still present where it was registered. It uses the token (which is stored in config.toml) and works by authenticating the API request that the verify command is making. config.toml doesn't have the ID from GitLab's database, because it's not needed at all - in all Runner -> GitLab communication Runner is properly and uniquely identified by the API.
Could you elaborate more about the reason why you need the Runner ID on the Runner side? What use cases it helps you resolve?
Maybe there is a valid need that I don't see. But still, in that case we should work on allowing to request such ID through the API using the token authentication (maybe by extending the verify command output to print such ID additionally to the check status) and maybe even storing it in config.toml - as totally unused value, useful only for the user.
Because of how the authentication works right now, not showing the token in the API, after it was created and passed to the Runner during registration, is the only proper and secure way to go.
BTW we've started a discussion about changing how the Runner's API authentication is handled, and one of the ideas also proposes the need for using the Runner ID. But in this case the value would be fully usable and required to be stored in the config.toml
Read the runner api auth discussion; I'd agree the runner token being stored in config.toml is like storing a password in a text file.
As for the "use case", right now its the only way to correlate data; For example, we can query the gitlab api to find out things about runners that the server knows about, and id is their key. However, Prometheus metrics for the runner the only correlatable identifier IS the runner token:
gitlab_runner_jobs_total{instance="10.164.57.28:9252",job="windows-gitlab-runners",runner="hyufGzbg"}
Ideally, there would be an id label in the prometheus metrics (but, the runner does not KNOW its id). For now, the "instance" label is uncorrellatable as no gitlab runner API data has that matching value.
I totally agree runner token should be removed from config.toml, for security reasons. But unfortunately, there is a deficiency in prometheus metrics, that will become all but useless if there is no correlateable identifier label. I assume this should spawn an issue.
@brett_jacobson To make it clear - the credentials for Runner's authentication in the GitLab API are and for now will be stored in config.toml. Whether it will be the token, a private key or any other credentials type. Runner must store it somehow to be able to use them. And if someone have access to the Runner host, then this is way bigger security issue (or trust for such person of course!) than only knowing the API.
The problem that we have here is that this token is intended to be used by the Runner only. But it's stored not only in the Runner's configuration file. It's stored also in the DB and with current API version, you can even read it from the API response. The current token is also not easily rotateable - if it was revealed the only way to secure the credentials is to re-register the Runner. And whoever will obtain this token is able to act "as the runner". In many cases it's OK-ish, because access to the API is already limited to people that have access to the project/group/GitLab instance. But what about a simple case when someone notes the value of such token obtained from the API when he is the member of the team and then, for example, leaves the company? Having such token can do anything "impersonating" the Runner, while neither is allowed to anymore nor may have good intentions.
This is why I propose to use an asymmetric keys pair for this - with this only the person who have access to Runner's host (which usually is way more limited than "who is the project maintainer") would know this value. And reading the public key value from GitLab's database (or API response like we have now with the token) would be not a problem - you can't authenticate with a public key. You must have the private one, which is stored only in the config.toml. But, this is the future and it will not happen probably within next few releases. While we need to secure the communication before then.
So, to summary, I propose to go forward with such steps:
We continue to merge this MR. And I don't think we will change our minds about removing the token field from the API.
We should update Runner to store an - unused for now - ID of the Runner in the configuration file during registration. We should also extend the verify command to allow already registered Runners to obtain and save this ID.
Thanks @tmaczukin. I created gitlab-runner#25441 to request this id be added to the Prometheus metrics, since they'll become uncorrelatable without the runner token label.
Should we remove it with a feature flag before deleting the code entirely?
It makes sense to have a FF just in case, we may rollback instantly if we need it. However, here we are introducing a breaking change , so I am still not sure about this.
I have a mixed feelings. The goal is to remove this field entirely and as a breaking change in the API I'd say it should be done in the major release. Which means either now (without a FF) or in 14.0 (which is another year of this field exist in some form with all the consequences that it have).
We've already mentioned this incoming breaking change in the previous release blog post, and the 13.0 release is just a finalization. I'd say that with the major release we expect some things to break (that is why we accept this only on major releases, which is done only once each few months). I'm not sure what purpose FF would have in this case. We want people to stop using this field starting with 13.0 and further.
Reading the discussion at gitlab-com/Product#1127 (comment 333994512) I'd say that FF is the way to handle a breaking change that changes the behavior. Not the one that removes something completely
@marin@joshlambert any advises how to proceed with "code removal" breaking change on a major release?
If we can put removals behind a FF that is ideal, as we can communicate as common date to our users on when things will happen. Right now it's hard, as it's based on when the change lands. This means users start to see breaking changes streaming in at various points in the 13.0 dev cycle, which is not ideal.
So if we can go with something like: gitlab-com/Product#1127 (closed) that would be best. Understand that it may be difficult to do now within the timeframe of 13.0, or at all in some cases depending on the change.
For next year though, I think we need to have a much clearer breaking change messaging to our users. We will have far higher paying customers on GitLab.com.
This breaking change actually broke our fully automated GitLab Runner operator for gitlab.com, our whole CI system was down for 3 days due to that change and I am really not amused! As a SaaS or Cloud service provider one simply does not introduce breaking changes into and existing REST API version. For users of those API(s) it's very hard to follow each single deprecation announcement, but it is relatively easy to track the API version one is using. Personally, I would really appreciate if GitLab were using semantic versioning for its API, only introducing breaking changes with new major versions.
We understand that it is not suitable to make a breaking change in API without a version change. However we needed to remove token attribute from the API as soon as possible because of a security issue. Still, we postponed this removal from %12.10 to %13.0 to show a warning beforehand.
One thing we needed to do was that we should have made this change on the official release of %13.0. To this end, we've just rolled back this, so token is available right now. However, it will not be visible again after 22nd May.
does this breaking change require an upgrade of runners? what is the lowest version runner can work with this breaking change applied (i.e gitlab server is upgraded to 13.x and runner is 11.9)
We have runners running on preemptive GKE nodes using helm chart. When node is replaced, we re-use runner token if it exists for a specific runner, and only re-register if no token.
Without token, each time preemptive node is replaced, a new runner is registered. For locked shared runner the runner becomes open until we reassign projects to the shared runner.
As others have noted, we were similarly leveraging the returned runner_token to automatically provision runners. Manually registering runners is untenable across hundreds of nodes in our center not to mention that they'll be orphaned on stateless nodes where a config.toml will not exist after reboot.
Certainly I understand why this was removed from a security standpoint, but I'm disappointed that there was little thought around mitigation for those who were impacted. I thought I might have a workaround via destroying and re-registering all runners associated with a host, but even that won't work because runner deletion can only be authenticated using a runner_token. At least if I could delete a runner with an admin PAT, I would have some path forward, but that does not work. I'm privileged enough through the UI, why not programmatically?