Upstream pipeline execution can be controlled by users with CI permissions in a downstream project

added devopsverify grouppipeline execution sectionops security labels

@dcouture Would you mind weighing in on the severity of this?

Nice find @drew! There's definitely a violation of our permission model that needs to be fixed, but I'm not sure if there's a way an attacker can actually leverage this for something else than an unexpected deployment.

Mitigating factors:

anything using the CI_JOB_TOKEN will fail because the user who owns the job won't have access to the project - the user here can't actually extract any information from the pipelines they triggered (correct me if I'm wrong here!)
the user cannot modify the CI config used in the project they don't have access to

On the security side I would rate it severity3

Thanks @dcouture! So both of your mitigating factors are correct. There's a flip side to the first point though - when the identity of the job owner has changed, any owner-identity-based actions (i.e. external reporting) can be automatically redirected to the user who stole the job. This is something I could see being configured under the presumption that only project members with CI permissions would ever be able to own jobs.

If a job is supposed to send detailed reporting about infrastructure configuration to an SRE after they trigger a deployment, I can steal that report because I own the job now.

Thanks @drew! Good call out, I think that scenario is enough of an edge case to keep the severity at severity3 priority3. @marknuzzo I see you had set severity2 priority1, if you have other motivations for those labels I certainly have no objections for this but you can lower it to S3/P3 from the security point of view.

@marknuzzo I see you had set severity2 priority1, if you have other motivations for those labels I certainly have no objections for this but you can lower it to S3/P3 from the security point of view.

HI @dcouture - thanks for the ping here and the additional context around the S/P values. In regards to the origins of the priority1 and severity2 labels here, I mirrored what we originally set in #348465 (closed) when the infradev labeling was applied. Because the workaround by groupdelivery was intended to be a short-term fix, the severity could go either way here (I updated to severity3 now) but I wanted to make sure that we were helping unblock as quick as possible given the broader impact to delivery and it's also an OKR for FY23Q2.

Thanks everyone.

mentioned in issue #348465 (closed)

changed the description

added Category:Continuous Integration label

Please add the bugvulnerability label to this issue if appropriate. Regardless, please add a comment or to indicate you have seen this message.

, please can you add a type label to this issue to help with issue discovery in issue reports.

(improve this comment?)

added auto updated label

added infradev priority1 severity2 labels

added typebug label

added VerifyP1 label

changed milestone to %15.2

added OKR-FY24Q2 backend labels

added workflowplanning breakdown label

added 1 deleted label

added needs weight label

Hi @drew - can you please apply a weight to this issue when you have a chance? Thanks!

/cc @jheimbuck_gl

I bumped this up to a 5 as a worst-case estimate because of the unknown around integrating with &6947 (closed) and inventing a new kind of notification.

@f_caplette Could you have a read through this issue and mark this as blocked by the correct issue from &6947 (closed)? If we do this without bridge-retry being fully rolled out, it will be impossible to restart the upstream pipeline execution and that's no good.

Thanks @drew ! The thing is to have the retry functionality, all issues listed in &6947 (closed) need to be done. We are currently at step 4, which is to try out the UI with a basic retry functionalitty and iron out the problems that we see. Currently, the biggest issue is that retrying a bridge leads to seeing all the spawned downstream. So if bridge_job_1 is retried 4 times, we are going to show 4 downstream cards for bridge_job_1. We are looking into making the API return only the latest one so that we are not showing them all in the graph and pipeline mini graph.

After that, we will implement step 5 and 6, which is essentially doing the same as step 4, but clean and with tests.

@lauraX in case you have more to add here.

cc @bsandlin Since you might take this over.

Thanks for the updates @drew and @f_caplette - from a timing standpoint, does it still seem feasible for this issue to be in %15.2 given the upstream dependencies at the moment? For now, I'll mark this as blocked.

Definitely blocked, but I'm not allowed to mark it as blocked by the whole epic so that can get added once the final issues are created

Of everyone in this issue, I probably know the least about the timing of &6947 (closed) relative to %15.2

@f_caplette @drew @marknuzzo I understand that the functionality behind the &6947 (closed) epic will be a good way of allowing users to retry a bridge job - but I don't think it should be marked as a blocker for this issue, since the feature doesn't really touch permissions or makes them any better . The proposal in this issue also mentions not reseting the source bridge/trigger job, which can be worked on in parallel. wdyt?

@lauraX We can definitely work on this in parallel, but I'm not sure about releasing it before &6947 (closed). If we do, it will become impossible to restart upstream pipeline execution after a downstream failure. The only way to do that today is automatically via downstream retry, which we're explicitly going to take away here.

So, &6947 (closed) doesn't have to block this proposal, but we'll be disabling a feature by delivering this fix first. Product and AppSec can duel over this, what do you think?

@drew - got it. Sure, that works for me! From my side, I will start cleaning up the backend MR to get it ready for review. Although the spike is currently in frontend's hands, I don't think there will be much change to the backend so it's probably safe to do this.

%15.2 Might be a little tight for this functionality, but it all depends on how complex it will be to handle trigger job spawning multiple downstreams. Essentially, up until now, one trigger job could only spawn on downstream. Now that we allow the retry action on the trigger job, it means spawning multiple downstreams, which all show up in the graph.

Ideally, the filtering of these "unnecessary" downstream pipelines would happen on the API side so that the client doesn't have to do it everywhere.

@f_caplette @lauraX @drew - is there a single issue we can use as a blocker for this to indicate what needs shipped before this does? I see quite a few unscheduled issues in &6947 (closed) and i'm unsure which we need to wait on before delivering this issue. In the meantime since this is blocked and hasn't started i'm moving it to %15.3 and we'll triage against other issues in that milestone.

@dcouture - making you aware this will miss the due date/SLO for sure.

cc @marknuzzo for awareness, no action needed.

@jheimbuck_gl Technically the entire epic needs to be completed before we move with the next step. The reason why there are unscheduled issue is that the spike we are working on for PoC was hard to promise at a specific date so once the spike is done, other issues would automatically get top priority. If we are able to filter downstream pipeline successfully in 15.2, then the rest should come in 15.3 pretty quickly

@f_caplette is there a last issue in the epic that will mark it as being closed we could use then? Is there a "good enough" point maybe?

@jheimbuck_gl That's a fair point! I've created #367547 (closed) which represents that last step we need to do, which comes after the spike (essentially, just implementing what we tried out in the spike, but well done ). I am filling it out right now, but it should be an easier point to reference.

thanks @f_caplette much appreciated.

@marknuzzo seeing as this is blocked can we move this to the %Backlog OR is there an alternative approach we can take that meets the security needs that would not be blocked in the short term and then cleanup that tech debt?

cc @dcouture

Hi @jheimbuck_gl - Given the timing that @f_caplette noted above, I would feel more inclined to put this to the %Backlog for now and then bring it forward once that work is complete.

/cc @samdbeckham

Thanks @marknuzzo - Unassigning @drew since this isn't actively going to be worked.

marked this issue as related to #348465 (closed)

assigned to @drew

changed due date to July 26, 2022

mentioned in issue gitlab-org/ci-cd/pipeline-execution#101 (closed)

set weight to 3

changed the description

set weight to 5

removed needs weight label

added workflowblocked label and removed workflowplanning breakdown label

added security-awardsnomination label

Please add the bugvulnerability label to this issue if appropriate. Regardless, please add a comment or to indicate you have seen this message.

removed infradev label

removed the relation with #348465 (closed)

changed the description

added severity3 label and removed severity2 label

added bugvulnerability label

@jheimbuck_gl @samdbeckham @marknuzzo @dcouture This issue is ready for triage as per HackerOne process.

If this vulnerability is for a featureflagdisabled issue, regular SLOs don't apply and it simply should be scheduled to be fixed before the feature is made generally available.

About this automation: AppSec Escalation Engine

The bot picked it up because I applied the bugvulnerability label. Triage has already been done here.

added priority2 label and removed priority1 label

changed milestone to %15.3

mentioned in issue gitlab-org/ci-cd/pipeline-execution#104 (closed)

marked this issue as related to #367547 (closed)

changed milestone to %Backlog

unassigned @drew

added missed:15.2 label

@f_caplette @jheimbuck_gl Have we thought about the mechanics of deploying this, which should technically be backported, along with &6947 (closed), which (I think) will be difficult to backport? @f_caplette Let me know if I'm overestimating the effort there.

To backport this security check without adding bridge-job retry, we'll be breaking some child pipelines without a solution for those past versions.

@drew - those jobs are executed with the user who does the retry credentials though yeah not the original executors?

If so the jobs may fail again but for a different reason, no access or whatever, i think. @fabiopitino could probably confirm or correct my thinking here.

Yes, but the ability to click retry is the work that's hard to backport. If we backport the permissions check, we're going to fail more pipelines (correctly, but potentially unexpectedly) without the retry being available to fix it.

Thanks @drew i'm looping in some more folks?

@dcouture what are the options for NOT back porting this?

@v_mishra - I'd like your opinion about back porting the security fix which will impact job retry in those versions. We could include a note about that behavior in the security release post. Doing this (backport, note about broken experience) would be my first choice for how to deal with this.

@jheimbuck_gl @drew sorry for a very basic question but what is the security check we're talking about backporting here if not permission check?

@jheimbuck_gl we have this process for risky fixes or breaking changes.

Is there a workaround for the broken retry that could be mentioned in the blog post?

Thanks @dcouture

@drew

Is there a workaround for the broken retry that could be mentioned in the blog post?

I think the workaround is re-running the entire pipeline, not just the trigger job correct?

what is the security check we're talking about backporting here if not permission check?

Apologies for the confusion, there's only one check we're talking about adding and it's the one in the proposal.

I think the best way to explain this is with the two kinds of failures created by the retry function that we saw in #348465 (closed):

Scenario 1 A trigger job fails immediately, and says the downstream pipeline could not be created because of a permissions problem. It cannot be retried, so the whole upstream pipeline needs to be restarted from the beginning. This is bad UX and wasteful of CI resources, but ultimately not a security problem. This is being addressed in &6947 (closed), which will not be backported because there are no security problems.

Scenario 2 A trigger job successfully creates a pipeline, but a handful of the jobs in the downstream pipeline have been reassigned to other users because of (hand wave) our retry logic. In this scenario, there are two possibilities:

Scenario 2-A All the downstream jobs complete successfully, nobody really notices that the jobs have been reassigned, and everyone goes on their way. They don't notice whether or not the jobs are owned by people with sufficient permissions, but assume everything is fine because nothing failed. This is potentially a security problem, but we don't know for sure because we don't check.
Scenario 2-B Some of the downstream jobs fail, for somewhat unexpected reasons. Upon investigation, the person responsible for the set of Pipelines looks at the failed job and notices they're owned by individuals who don't even have access to the project. That unexpected individual represents the "User" for the CI job is causing some kind of problem with authentication or an identity-based integration in the job script. This is a clear, noticeable, and quite confusing security problem.

Our proposal here plans to close the security hole by turning both scenario 2-A (success) and scenario 2-B (failure) into scenario 1, which is a clear failure, but with bad UX that wastes CI resources. Going forward, this will be fine because we fixed the UX in &6947 (closed). But if we backport the security fix, which is the proposed added permissions check, the versions that receives the backport will have turned scenario 2-A (good UX, but insecure) into scenario 1 (bad UX, secure).

Folks in 2-B on backported versions go from strange/bad (2-B) to clear/bad (1), which is an improvement but not a fix. I'm okay just telling them to upgrade. Don't worry about them.

If and when Scenario 2-A people start complaining that they're stuck in scenario 1, we'll tell them that they have two options:

Restart the parent pipeline from the top, and don't let people in downstream pipelines retry any jobs.
Upgrade to a newer version of GitLab that includes &6947 (closed).

@dcouture @v_mishra @jheimbuck_gl Are we okay with presenting annoyed people with those two options, justified by the increased security?

@drew great writeup thanks for articulating all that.

I am OK with backporting the fix without the retry. The users who will notice it most are subset 2-A and we are clearing up confusion/UX for the users in scenario 2-B. So net net we're in a better spot for users.

If we do the do the security release blog with a note that trigger job retry is available in %15.x so users do not have to re-run the whole pipeline I think we are in good shape.

@dcouture @v_mishra is that approach reasonable to you?

Very reasonable! Thanks a lot @drew and @jheimbuck_gl

(hands down the best breakdown of such a technically complex problem I've read in a long time )

Are we okay with presenting annoyed people with those two options, justified by the increased security?

@drew @jheimbuck_gl Yes I think the two options are a reasonable choice compared to serving any kind of security vulnerability.

thanks @v_mishra and @dcouture - @drew I thin that helps us unblock this issue then yeah?

cc @marknuzzo for awareness, no action

mentioned in issue #367547 (closed)

changed the description

Upstream pipeline execution can be controlled by users with CI permissions in a downstream project

A Problem

A walk through the code paths

Why was this built in this way?

Does this actually happen to anyone?

A Proposal

Summary of Changes

Designs

Child items ...

Activity

Upstream pipeline execution can be controlled by users with CI permissions in a downstream project

A Problem

A walk through the code paths

Why was this built in this way?

Does this actually happen to anyone?

A Proposal

Summary of Changes

Is blocked by

Relates to

Activity