GitLab CI/CD provides a caching mechanism that can be used to save time when your jobs are running. [...] The most common use case of caching is to avoid downloading content like dependencies or libraries repeatedly between subsequent runs of jobs.
The protected status of the branch on which the pipeline is running is not taken into account. This means that a user with developer permissions, or any permissions allowing them to run a pipeline, could poison the cache and backdoor those "dependencies or libraries", and get code execution in the context of a protected branch and even in the context of the final build that will be deployed to other environments.
Steps to reproduce
Set-up a gitlab project with the following .gitlab-ci.yml file on the master branch which is protected by default.
cache:# This is recommended in https://docs.gitlab.com/ee/ci/caching/ # It is considered "safe from accidentally overwriting the cache". key:"$CI_JOB_NAME-$CI_COMMIT_REF_SLUG"paths:-critical_binarystages:-testtest:stage:testimage:ubuntuscript:-echo "Building critical_binary if it isn't cached."-if [ ! -f critical_binary ]; then echo 'echo critical_binary is safe' > critical_binary; else echo "Reusing cached critical_binary."; fi;-echo "Executing critical_binary."-sh ./critical_binary
Observe that the pipeline ran successfully and that it contains critical_binary is safe.
Create a new branch feature/cache-poisoning, edit the .gitlab-ci.yml file to the following:
cache:# If the key is random it would still be possible to poison the cache # once the pipeline started because it shows the key. It would # become a race condition, it's probably easy to inject in # between long running jobs. key:"test-master"# Easy to guess this one.paths:-critical_binarystages:-testtest:stage:testimage:ubuntuscript:-echo "Overwriting critical_binary with evil code."-echo 'echo critical_binary is evil' > critical_binary
Observe that the pipeline for feature/cache-poisoning ran successfully and it contains Overwriting critical_binary with evil code.
Start a new pipeline for master, observer that it now outputs critical_binary is evil, showing that its cache has been poisoned and arbitrary code execution on protected branch has been achieved by a user with only the privileges to run a pipeline for the unprotected branch feature/cache-poisoning.
Impact
If a user has enough privilege to push code and trigger a pipeline on any branch (eg: they are a developer), then they can poison the cache for a protected branch and execute arbitrary code. This elevates their privilege considerably. For example:
A developer could leak sensitive information from protected environnement variables. (deploy tokens, registry access, AWS tokens, etc.)
A developer could alter the build process and backdoor the final product without detection because they wouldn't be touching the git repo.
Examples
In addition to the previous explanation of how to reproduce this, I set-up a demo project on gitlab.com.
Jobs running for an unprotected branch can overwrite the cache for a those running on a protected branch.
What is the expected correct behavior?
Jobs running for an unprotected branch should not be able to influence those running on a protected branch.
Cache for protected and unprotected branches should be separate. Maybe cache keys should have a forced prefix indicating the privilege with which the job is running.
Relevant logs and/or screenshots
N/A
Output of checks
This bug happens on GitLab.com
Impact
(Copied from Impact in summary.)
If a user has enough privilege to push code and trigger a pipeline on any branch (eg: they are a developer), then they can poison the cache for a protected branch and execute arbitrary code. This elevates their privilege considerably. For example:
A developer could leak sensitive information from protected environnement variables. (deploy tokens, registry access, AWS tokens, etc.)
A developer could alter the build process and backdoor the final product without detection because they wouldn't be touching the git repo.
Attachments
Warning: Attachments received through HackerOne, please exercise caution!
Thank you for your submission. I hope you are well. Your report is currently being reviewed and the HackerOne triage team will get back to you once there is additional information to share.
I was able to validate your report, and have submitted it to the appropriate remediation team for review. They will let us know the final ruling on this report, and when/if a fix will be implemented. Please note that the status and severity are subject to change.
@erushton A HackerOne reporter pointed out that the caching mechanism doesn't make any distinction between protected and regular tags/branches and it's really easy for a developer to poison the cache and potentially steal protected variables without needing to merge the malicious branch/MR.
How much of a technical challenge would it be to have a separation of the cache between protected branches/tags and regular branches/tags?
@dcouture this seems kinda bad. I'm surprised you only gave this a severity3
How much of a technical challenge would it be to have a separation of the cache between protected branches/tags and regular branches/tags?
Hopefully not much, but it's also not a trivial change to make.
And would that make our storage costs explode?
If we have unique cache for each protected branch, and then a generic unprotected cache. Then it's a roughly linear increase in costs based on the average number of protected branches. So if projects we host have on average 1 protected branch, then we'd be doubling cost. If on average projects we host have 2 protected branches, then we'd be tripling our cache storage cost. Likely the average number of branches is less than 1 (not everyone uses the feature) and so it's probably less than doubling the cost.
FYI @ajwalker because you love everything cache and artifact related I'll probably get you to propose a solution here
Heh good call out @erushton, we were having a discussion on Slack about the full impact of this and indeed it is upgraded to severity2. It does "feel" severe but we've been trying to use CVSS internally to normalize our severity ratings and that one was kind of tough to rate properly on that scale.
Cache for protected and unprotected branches should be separate. Maybe cache keys should have a forced prefix indicating the privilege with which the job is running.
I wonder how much this would break if we were to take this approach by default? On the other hand, if everything is world writable by default, that could be worse.
Unless I'm mistaken, there's not really "protected and unprotected branches", right? My understanding is what is and isn't protected is based upon the expression rules provided for the job.
A prefix of the branch alone might not cut it: It might be entirely valid to have different branches, that required different expression rules, wanting to write to the shame cache key, but at the same time, deny other branches the same privilege.
Thinking out loud, and I might come back to this and it sound silly:
I wonder if we could support arbitary rules using the expression engine, but these rules themselves would compile down into a prefix key? A single modification to the rules, would work exactly as if the key itself had been modified: the cache would invalidate.
GitLab would only allow pushes to the cache if the rules pass.
The cache key would resolve to: sha256(combined_cache_rules_expression)+ "test-master".
The only possible way for me to target the same cache key to push to, would be to use an identical set of rules. However, GitLab would execute this and it wouldn't evaluate to true for the branch I am on, so writes to the cache would be denied.
This way, I could write rules that allow multiple branches the ability to write to a cache key, whilst denying others. Or you could only allow certain authors (maybe certain roles?) that ability, or have free-for-all Mondays etc.
I knew we had "protected branches", controlling who can push/merge to a branch, but figured this might be a different context and separate from CI.
I guess they go hand in hand though? if: '$CI_COMMIT_BRANCH == "master"' will only be true for certain jobs because somebody with permission triggered a merge or pushed to master.
I imagined the rules to work similiar to that, controlling permission for write access to a cache, in much the same way you might restrict a deployment job to production. But the rules themselves also acting as a key.
Rather than expression rules though, you could use matching branch/wildcard names of relevant branch protections to become part of the key? This might still allow users who have permission to write to the same cache key from several different branches, if permitted. Again, the cache might be invalidated if you modify protection rules.
This was probably more inline with what the author was thinking?
I'm wondering if the logic for something like this would have to happen on the GitLab side? Whilst Runner handles the actual uploads and storage permissions, the cache key is driven by GitLab. I might still be thinking about this all very wrong though :D
I've created a PoC implementation that segregates caches between protected/non-protected refs by adding a mandatory suffix. I guess there might be scenarios which are insecure by nature (e.g. using the shell executor) where a malicious user might be able to side-step this measure by using the gitlab-runner cache-extractor/gitlab-runner cache-archiver commands from inside the job in order to manipulate the cache.
I'll also be exploring tackling this from the Runner side by adding a subdirectory to the tree structure, might be simpler in the end.
This is the MR for the implementation on the Runner side. Although by default it is vulnerable to an override of CI_COMMIT_REF_PROTECTED in the YAML by a malicious user, this can be circumvented by doing our variable lookup a bit differently in this situation. The end result is better than the PoC above as it doesn't imply a change to the final key name. The change consists of adding protected/<true or false>/ subdirectories to the existing cache file path, i.e. runner/<token>/protected/<true or false>/project/<project id>/<cache-key>/cache.zip.
Looks like my concern about users side-stepping this measure using gitlab-runner cache-extractor/gitlab-runner cache-archiver has already been captured in another issue by the same author.
Ignoring the local disk/shell executor issue, I think this problem will still exist unless we consider the rules of the protected branch, or protect caches instead with a different set of rules.
CI_COMMIT_REF_PROTECTED tells us very little, a protected branch could just as easily include rules that have entirely the same access as an unprotected branch.
For example, consider that I add two protected branches:
*-developers-only, which only allows developers to push.
*-maintainers-only, which only allows maintainers to push.
Both of these are protected. But unless I'm mistaken, if that's the only information we're using, we've done nothing that could potentially stop a developer poisoning the -maintainers-only caches.
Protected branches seem to mostly revolve around who can write to a certain branch. Personally, I think the cache functionality is mostly unrelated to this and should be controlled by the rules expressions. They seem to usually dictate what can occur at the CI level, and are able to use variables such as CI_COMMIT_REF_PROTECTED to make decisions. However, having said that, I suppose even if this were supported, a good default would be to have it influenced by protected branch rules if possible, so going this route, if only to start with, probably still makes sense.
A simple method to provide greater protection over the "protected/true" and "projected/false" in the key name would be to instead use the name of the rule (or hash of the name) that made the branch protected. Unfortunately though, this is made more complicated because it looks like rules are able to overlap one another. I sort of touched on this in my comments above, but I think the solution to this would be to make the key a combination of all the rules that can make the branch protected, but this does mean you would suffer cache invalidations should the branch rules be updated, but I don't think that's really too bad a situation.
@ajwalker I'm not sure that the scenario you presented is something that is usual in the real world. If it is, I would say it is the responsibility of the project's administrator to understand the consequences of allowing users to push to a protected branch (assuming we document it).
I think that leveraging the rules to automatically segregate the cache/build directories would be much more involved than the current solution. At that point, wouldn't having 2 jobs (one that runs for maintainers and another that runs for developers) that specify different cache keys address your concern? It wouldn't be automatic like your suggestion, but the resulting implementation would be much simpler. Currently, I have no data to justify investing a lot more time in implementing a completely automated solution that works on all scenarios.
It feels different to the other security implications there. Protected variables, for example, are read-only and administrators have the option to further segregate them based on the environment scope.
Caches here will allow writes and can affect a completely different branch.
At that point, wouldn't having 2 jobs (one that runs for maintainers and another that runs for developers) that specify different cache keys address your concern?
Yeah, I guess a simplier solution of putting the role name in the key would work. Might be useful to expose this as a predefined variables: GITLAB_USER_ROLE which could be owner, maintainer and developer.
Our documentation does say:
Runners marked as protected can run jobs only on protected branches, preventing untrusted code from executing on the protected runner and preserving deployment keys and other credentials from being unintentionally accessed. In order to ensure that jobs intended to be executed on protected runners do not use regular runners, they must be tagged accordingly.
This seems to be directed at the shell executor. If we extend this to the use of local caches, it also makes gitlab-runner#28025 (closed) disappear. In that, we put it on the administrator to ensure that you don't use local caches (shell executor, docker and Kubernetes where we place the cache into a volume) for a runner running both protected and unprotected branches.
This seems to be directed at the shell executor. If we extend this to the use of local caches, it also makes gitlab-runner#28025 (closed) disappear. In that, we put it on the administrator to ensure that you don't use local caches (shell executor, docker and Kubernetes where we place the cache into a volume) for a runner running both protected and unprotected branches.
@ajwalker Why do you say it seems to be directed at the shell executor? I didn't get that impression, did I miss something?
I'm starting to wonder if this is more a question of improving the documentation regarding how to correctly segregate jobs across runners rather than actually making changes to the code. I'd be happy to jump on a quick call tomorrow to discuss this.
@DarrenEastman As mentioned above, the documentation already covers some security aspects of protected runners and protected jobs. I'm starting to think that we shouldn't extend the product to handle this scenario, but instead focus on documenting the risk of cache/build directory poisoning so that project owners can make an informed decision and leverage runner labels to segregate jobs that they want to be protected to a protected runner, so as to not run the risk of mixing both environments and allow regular developers to poison the build.
The alternative of making this completely automated and hands-off, as mentioned above by Arran will be pretty complicated. WDYT?
@ajwalker Why do you say it seems to be directed at the shell executor? I didn't get that impression, did I miss something?
@pedropombeiro I guess it's not just the shell executor, but any setup that isn't an isolated environment. So for example, docker-machine, where machine re-use is enabled.
In some environments (like GitLab.com shared), it doesn't matter whether you're running a mixed unprotected and protected branch workloads, because everything is done in a brand new VM and is isolated. Where this now appears to not be true is with caches.
we document the importance of not sharing build directories of protected refs with build directories that can be triggered by untrusted parties, as this can lead to the reuse of a local cache.zip, as long as it is not older than the remote version. Case in point, an attacker triggers a build and poisons a cache that gets uploaded to the non-protected remote cache directory. Just afterward, a protected build is triggered by a trusted project member and this job happens to share the same build directory. Since the local cache.zip is deemed up-to-date, it is used instead of the non-poisoned protected remote cache.
we document the importance of not sharing build directories of protected refs with build directories that can be triggered by untrusted parties
I agree. Maybe we can do more in the future though? It would be nice to create a checksum of the cache, provide it to GitLab and then have the GitLab instance provide it to Runners on subsequent jobs. This way, if you cache becomes corrupt/is tampered with, it would basically be thrown away as if the cache didn't exist.
This would require GitLab storing a small amount of data for each cache though, whereas right now it plays no part in the caching functionality. Might be worth proposing though?
I agree. Maybe we can do more in the future though? It would be nice to create a checksum of the cache, provide it to GitLab and then have the GitLab instance provide it to Runners on subsequent jobs. This way, if you cache becomes corrupt/is tampered with, it would basically be thrown away as if the cache didn't exist.
@ajwalker This would guard against someone corrupting the file on the external cache provider (or perhaps on the local disk for a follow-up job), however, I'm not sure how much of a priority that would be.
We currently support something similar but only for 1-2 files (normally the package manager files) in the form of cache:key:files. I presume that the limitation is due to performance concerns in calculating the checksum of what might be GBs of data?
Have we considered to update the cache key on GitLab side to separate protected from non-protected branches? It would work in the same way how the cache index works (the thing that handles our "clear cache" feature). Added by GitLab and not overridable by the user would remove the problem. I just don't see any need to update Runner code for that
@pedropombeiro I'm assigning this to you well ahead of it's intended milestone to start thinking about how to implement it. See @ajwalker's comment above for some more insight.
There are concerns about the best approach to follow. Given the current workload I have until the end of the week (including #329438 (closed) plus outstanding MRs) and the upcoming 2 weeks vacations, it is becoming very likely that this won't make it in %14.2. cc @DarrenEastman@erushton
We are going to meet with @tmaczukin, @erushton, @dcouture to discuss the general approach to these security issues that fall into the category of "someone I don't trust ended up messing with my build because my builds are not isolated"
We're going to prioritize a documentation solution to raise awareness for users of the dangers of running builds that lack isolation gitlab-runner#28211 (closed)
I see that there's a certain amount of work that was done here, my understanding is that there were some concerns because the runner can't identify all the conditions in which someone should or shouldn't be able to run a protected job. If we have a less-than-perfect MR that's still an improvement we could go ahead with it but if it's still a long way from being merge-able we can abandon it.
@dcouture@jcaigitlab the current status is mentioned in #330047 (comment 633179548). I feel that splitting the caches by protected status is still a worthwhile improvement so that a random contributor can't touch the cache of an "official" build. The MR is pretty much complete for that, AFAIR. The rest of the work is to document how maintainers need to be aware of the need to segregate build environments so that local caches aren't reused and become a vector for attack.
This grouprunner bug has at most 25% of the SLO duration remaining and is ~"approaching-SLO" breach. Please consider taking action before this becomes a ~"missed-SLO" in -163 days (2021-07-06).
@DarrenEastman I was reviewing this issue before making it public and I'm realizing we (or maybe just I ) might have got confused with all the issues that were going to be addressed by a documentation update and this one is actually still not addressed. I can reproduce the issue using our shared runners (and not the ones with shared docker volumes) and, assuming they are configured correctly, this doesn't appear to be a configuration problem.
@pedropombeiro apologies for a months-late context switch but you were working on a couple of MRs to address this here, were they abandoned because of the decision to go forward with a documentation change or were there problems with the implementations?
@dcouture AFAIR this was the latest state when I stopped working on this. As you mentioned, there were two MRs that tried to mitigate this by ensuring that caches for protected branches were placed in a special location separate from the usual location (one MR did that from the GitLab side, and the other did the same, but from the GitLab Runner side - we just have to choose which approach makes more sense). I guess they were abandoned because I moved to the Fleet team, and the engineer who took over also ended up moving to another team. There hasn't been any movement on this since.
@erushton the MR should be pretty close to complete, assuming that no new concern is found during the reviews. I also need to recreate the testing environment to determine what changes are required to the documentation.
By the way, I just re-read things and noticed that the CACHE_FALLBACK_KEY logic happens on the Runner side, so if we go ahead with the Rails MR approach, then the CACHE_FALLBACK_KEY logic will remain as-is and that means that it caches need to be shared in that scenario.
Reading through the MR that introduced CACHE_FALLBACK_KEY, it looks like the intention is for CACHE_FALLBACK_KEY to represent the main branch, while the other branches will use their own cache name (e.g. something with $CI_COMMIT_REF_SLUG), so it should not be an issue to not add a suffix (i.e. segregate) to the cache represented by CACHE_FALLBACK_KEY.
Since MR 1551 was published Support has received a number of tickets asking about this breaking change and wondering where it was announced. I realize it wasn't supposed to be a breaking change, but Support's received a few tickets from customers asking why we broke their caching.
@duncan_harris this wasn't announced before deployment because it was an exploitable security risk. Unfortunately, the documentation changes still don't seem to have gone live after a couple of days, adding to the confusion
Not sure if we should remove the confidential flag from this issue to provide more transparency to users, now that the feature has been released. cc @erushton@DarrenEastman
Not sure if we should remove the confidential flag from this issue to provide more transparency to users, now that the feature has been released.
We normally wait 30 days after a fix has been released for the issues to be made public. I haven't had time to look through this issue and re-familiarize myself with it, but we should not make this issue public. This is particularly true if it impacts our self-managed customers, who don't have an upgrade process yet (as the release has not happened yet).
We can add additional information in the security release blog post and (if not done so already) as part of the version-specific upgrade instructions, which should all be public when the security release goes out (scheduled for 2022-05-02). If this is having a significant impact on our SaaS customers and our support teams, we can consider working with Comms to get a carefullycrafted communication available for Support to respond with.
@duncan_harris please let us know if the ticket volume is burdening your team or if you need any assistance with crafting a response. Please note that we should not be revealing details about the vulnerability
I appreciate the need for a delay before issues are made public. Is there a way we could have coordinated to ensure SaaS customers wouldn't be surprised by a change like this?
I guess what precipitated this was the release to .com, so I wonder if it would make sense in situations like this to coordinate the .com release with the security release so that the window is reduced, and documentation is quickly available once it hits .com.
it would make sense in situations like this to coordinate the .com release with the security release so that the window is reduced
This is a discussion that needs to involve our Delivery team. I've created an issue in their retro repository and will let them know in Slack about it. Pedro (or anyone else), please feel free to add any detail to that issue description or otherwise offer your thoughts in the comments!