In order to block a user from using Shared Runners an account level block needs to be placed. In the event that a user is not intentionally abusive with our Shared Runners it would be great to only remove access to our Shared Runners and not the entire account.
Intended users
Infra, Production, GitLab admins
Further details
User inadvertently causes behavior that negatively impacts our Runners/Availability
Proposal
Provide the ability to block a users access to the Shared Runners
Available via the Users API to add it as a mitigation option in Scrubber (our mitigation tool) - This issue
This would lower the impact to a user as they would still have access to their projects/content if we need to mitigate against inadvertent excessive CI Usage.
What does success look like, and how can we measure that?
Blocking access to CI without blocking access to anything else on the account
This page may contain information related to upcoming products, features and functionality.
It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes.
Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.
@dewit - is there some documentation for Scrubber we can link to to give context about what needs to be provided in the API to remove user access to shared runners?
@drew - we talked about this issue previously in Slack, we have a repo for scrubber now but it's probably best to just think about the UI requirements for now given a new tool is in the works.
Mark Nuzzochanged title from Removing user access to Shared Runners through the REST API to Backend: Removing user access to Shared Runners through the REST API
changed title from Removing user access to Shared Runners through the REST API to Backend: Removing user access to Shared Runners through the REST API
I'm going to weight this as a pessimistic 5, because I'm not entirely sure of the implementation plan and have a few questions. It's possible for this to turn out to be a 3 after refinement.
Let's do some refinement!
I'm looking for the smallest change we can make to get this working, and it seems like we could make a relatively small change to Ci::PendingBuild:
def args_from_build(build) project = build.project args = { build: build, project: project, protected: build.protected?, namespace: project.namespace, tag_ids: build.tags_ids,- instance_runners_enabled: shared_runners_enabled?(project)+ instance_runners_enabled: shared_runners_enabled?(project) && shared_runners_allowed?(build) } if group_runners_enabled?(project) args.store(:namespace_traversal_ids, project.namespace.traversal_ids) end args end def shared_runners_enabled?(project) builds_enabled?(project) && project.shared_runners_enabled? end++ def shared_runners_allowed?(build)+ # Something like+ AbusiveUserList.where(user_id: build.user_id).none?++ # Or maybe this, but the User model is already gigantic+ !build.user.abusive?+ end
As far as the Builds themselves go
Because of the extremely sensitive performance requirements around Builds queueing, I'm not inclined to start adding columns and putting new queries into that workflow. Getting in front of the creation of PendingBuild records and adding logic to the calculation of an existing value won't affect the performance of the Runner's picking process.
Important to note: As a PendingBuild record does not have a directly associated User, we would have no way of going back and changing the PendingBuild so it could be picked up by a shared Runner. The Builds would need to be cancelled and recreated.
Those pieces about abusive Users are completely imaginary
I don't believe we have any kind of designation today that allows a User account to continue accessing the application but have their Builds not get picked up by shared Runners. We would have to invent this; create either another column on User (yikes) or a separate, new model like AbusiveUser to represent a User that's been designated in this way. And then it needs API.
@dewit @gitlab-com/ci-backend Do we have any existing code that we would use for designating users as blocked from shared runners? I'm not aware of any so it seems like we have to invent something.
we could make a relatively small change to Ci::PendingBuild:
@drew this could work, but it will add one more database query to each job that we process, we'll have to deal with a lot of stuck jobs, and update all pending entries when(if?) we remove the lock later on. I'm wondering if there is another less expensive option, since the user is already on the way of being considered abusive maybe we could block them from creating pipelines? We could add this as a check in Gitlab::Ci::Pipeline::Chain::Validate::Abilities or maybe even as a policy condition to disable the :create_pipeline permission.
@drew - The requirement says that the user should be able to use the rest of the features and that the block should only affect the given user. Also if the user is not using shared runners they should still be able to create pipelines - cc @furkanayhan@mbobin
Another angle, to add on top of what @drew proposed could be to prevent pipelines from running using the same mechanism as the CI minutes availability.
When the API is used to block shared runners access we need to update Ci::PendingBuilds.where(user_id: user.id).update_all(instance_runners_available: false) but when the block is removed we have to recalculate instance_runners_enabled: shared_runners_enabled?(project) for each project involved
it will add one more database query to each job that we process, we'll have to deal with a lot of stuck jobs, and update all pending entries when(if?) we remove the lock later on.
@mbobin So I thought about the types of new overhead here:
Adding a query to the process of creating a PendingJob seems like a less expensive option than adding a query to the picking of a job from pending and moving it to running. That's the part I was really trying to avoid. But tbh I'm less familiar with the current bottlenecks since we switched to this multi-table queue design, so let me know if I'm out of date on this.
Dealing with a lot of stuck jobs: I think the pending builds sit around for a maximum of one hour before they get dropped, don't they? This seems like a fine outcome to me, for projects that don't have non-shared runners available.
Updating Pending entries: Yeah I flat out decided they'd have to be cancelled/dropped and recreated. Maybe that's not a good product choice. But, I think performing operations to manage builds within the queue on the pending builds table might also be a bad engineering choice. Letting the queue table just be a queue feels like a safer way to get what we really want from that architecture.
since the user is already on the way of being considered abusive maybe we could block them from creating pipelines?
So, from what I understand we specifically want these designated users to be able to continue using non-shared runners. I think referring to them as "abusive" might have been too mean. Maybe they're... high risk users? Suspicious users? Users to be checked up on real quick? This isn't a hard block from CI functionality AFAIK so I think we have to stick with a light touch here.
But tbh I'm less familiar with the current bottlenecks since we switched to this multi-table queue design, so let me know if I'm out of date on this.
It should be able to handle stuck jobs, but high churn affects it's performance since it needs to be constantly vacuumed so we should avoid updates that are not really necessary.
@fabiopitinoThis seems like a decent way to not have the builds hang around until they get dropped by the StuckBuilds worker. My understanding is that we'd have to do this on top of what I'd suggested, because once we decide to not drop the Build because a specific runner is available, we'd still need to mark the build as not pickable by a shared runner. Is that what you're intending?
@drew With #350654 (comment 1065964898) we are preventing new builds that would use shared runners from being processed. Otherwise we would have a user that would push new jobs to the queue continuously. With #350654 (comment 1058800817) we are leaving jobs in the queue which would impact availability even more.
The actual problem statement in the issue description is:
User inadvertently causes behavior that negatively impacts our Runners/Availability
Although not very explicit, I'm interpreting it as "We want to prevent a user from flooding the CI process". Since shared runners is what we control most, we can cut access to them. The best way to prevent this problem would be:
When a user is blocked from accessing Shared Runners, drop jobs in ci_pending_builds immediately.
Any jobs that make it to the queue should be those that can be processed by specific runners - #350654 (comment 1058800817) ensures shared runners ignore these jobs until a specific runner picks them.
Stuck builds will be dropped at some point - impact here should be small since we would drop jobs that specific runners didn't pick on time.
When a block is removed we do nothing since jobs in the queue (ci_pending_builds) are those that can be processed by specific runners. New jobs that would use shared runners will make it to the processing.