Sometimes there can be so many jobs that there is a long queue and there needs to be the ability to prioritize certain pipelines' jobs so that they can skip the queue and run before other pipelines' jobs.
Users want to be able to identify the priority of jobs.
Then when we query for pending builds in the register job service we could order the builds by priority. Where lower numbers are higher priority and all jobs start with a certain default priority if not specified.
That would only work for specific (project) runners, group runners.
With regards to where to store the priority. Once the ci_pending_builds migration of the queue is completed it will be denormalized in ci_pending_builds so it could be initially stored in the build metadata table.
Acceptance criteria would be to initially support:
Private runners
Project-level
Permissions and Security
I suspect to avoid abusing this power, it should be restricted to Owners, Maintainers and Admins.
It may be fair to call this a duplicate of #33401 - both issues speak to the need for customized prioritization of "jobs to be done".
Edit: Duplicate isn't the correct designation. Rather than [as #33401 proposes] tying into an external third party system for prioritization of jobs-to-be-done; this issue proposes a GitLab-native implementation for jobs to be prioritized.
To give this a bit more colour; I could envision an optional way to prioritize pipelines based on Sub-group structure (i.e. a sortable list of sub-groups which defines their priority in the jobs queue). We could iterate upon this to also support QoS-like rules where one group can occupy up to x% of available runner capacity.
Labeling could be another approach that provides similar results. groupgitaly has a similar-ish proposal of "labeled repos": gitaly#1803 where a project's assignment to different classes of repo storage is controlled by a tag.
The concept of Quality of Service is quite common in networking contexts and may be useful to refer to as we consider what we could implement in GitLab.
Concepts I'd suggest we consider:
Prioritization
Max Throughput
Wait time
Prioritization
In a prioritized FIFO queue, concepts such as maximum throughput or wait time aren't required to ensure that more important work will be picked up before other work in a queue. For this reason, I'd suggest starting with the ability to prioritize jobs based on group or project.
Max Throughput
This can be thought of in multiple contexts globally (i.e. "no one user or project shall consume more than 80% of available resources"), and specifically (i.e. "this project shall be permitted to consume 100% of available resources").
This gets tricky if we don't have a good way to assess the total availability of runner resources. eg. with an autoscaling runner in kubernetes; how does rails understand the total capacity of the runners?
Perhaps given knowledge of all running and waiting jobs, we could infer maximum capacity when the number of running jobs plateaus and the number of waiting jobs begins rising. From there, we could prioritize the queue in such a way that ensures max throughput is respected.
Further complicated as the number of machine types grows.
Wait time
"john doe's jobs are low priority and can wait up to 24hrs before being picked up by a runner" vs. "Jane's are highest priority and should wait a maximum of 1hr, or be cancelled (i.e. their output wont matter past that point)"
This also seems tricky as wait time is hard to estimate without foreknowledge of how long other jobs will take to execute.
Starting with prioritization (and perhaps ending there) makes sense to me.
A way to statically define relative priority would be a good start; offering a programmatic interface to adjust this prioritization (ideally with an industry standard protocol) would be a good future iteration too.
@cheryl.li - I am adding this to the 14.2 needs weight issue we need this to get weighed and a technical proposal before the end of the month. I have a call with a customer on I need some technical implementation for this on 2021-07-28.
@jwoods06@jrreid The description seems to use both jobs and pipelines interchangeably, but it seems like the ask is to have pipeline priority, is that correct? Or is it both jobs and whole pipelines?
@jreporter Ahh, it was the first sentence that confused me: Sometimes there can be so many pipelines that there is a long queue and there needs to be the ability to prioritize certain pipelines so that they can skip the queue and run before other pipelines. It only talks about pipelines here, skipping lower priority pipelines.
I suppose the net effect is that it is the jobs within the pipeline that get priority, but I guess it's worth detailing if this is something that would be defined at the pipeline level and apply to all jobs in one pipeline (like added to workflow:, for example). Or defined as a job-level keyword, so you can have a subset of jobs in each pipeline be priority jobs. Both sound useful! I was just wondering what's the specific request in for the first MVC step.
Then when we query for pending builds in the register job service we could order the builds by priority. Where lower numbers are higher priority and all jobs start with a certain default priority if not specified.
That would only work for specific (project) runners, group runners. Maybe also for private shared runners(if the admin allows it).
For shared runners on Gitlab.com this would allow users to skip ahead of users from other companies, which we wouldn’t want.
@allison.browne I think this makes sense. The fair scheduling algorithm is implemented for shared runners only and I agree we shouldn't touch it for now.
For group and project runners we order builds by ID ASC. Perhaps ordering by priority and then builds ID would allow for this to work.
With regards to where to store the priority. Once the ci_pending_builds migration of the queue is completed it will be denormalized in ci_pending_builds so it could be initially stored in the build metadata table.
With regards to where to store the priority. Once the ci_pending_builds migration of the queue is completed it will be denormalized in ci_pending_builds so it could be initially stored in the build metadata table.
This is a great question actually. I wonder if the legacy queuing queries (that are still the default) will not become less efficient when we join with yet another table. We will also need to ensure the partity between ci_pending_builds.priority and table_we_will_store_it_in.priority.
@jrreid - from our call on 2021-07-29, I don’t have a clear idea on if this proposal will meet the needs of your customer. Can you help refine the use-case further?
Why interested:
In GitLab, the customer has no queue prioritization mechanism causing that the customer need to have at least 10x more vm's to be able to deliver critical items before not relevant. In TeamCity, the customer could move items on top of queue while less important jobs remaining for overnight runs. At the moment, the customer can do nothing like this with GitLab (in theory custom runners or mix of labels could help a bit, but it is not an option for shared machines and it will decrease overall throughput)
Current solution for this problem: No solution available
How important to customer: Important, because this will heavily support the planned migration from Teamcity CI to GitLab.
Questions: In which version of GitLab can the customer expect build priority or queueing mechanism implemented?
PM to mention: @jreporter - If you would like to talk with the customer about that request please feel free to ping me at any time. Customer will be happy to give further insights into their requirements/needs.
If you would like to talk with the customer about that request please feel free to ping me at any time. Customer will be happy to give further insights into their requirements/needs.
Why we are interested: It would be nice to have the possibility to set a priority on job/pipeline level. Getting high priority jobs triggered fast would help to utilize the build farm capacity as effective as possible (use the capacity where it is needed most). Super urgent bugfixes or high priority releases could get their pipelines/jobs crunched as fast as possible, instead of just putting these jobs at the end of the normal job queue.
Current solution for this problem: No Solution Available
How important to us: Important, productivity improvement
Questions: Is this being concidered, and do we have an indication on when?
Adding a label for FY23Q1 with a to indicate this a forward looking priority for grouppipeline execution delivery. Currently, we have the next ~3 months planned toward reliability initiatives
We are also interested in this functionality (on-prem installation, 55 billable users, license id 146961).
Currently we do pipeline prioritization through tags (i.e. release before regular builds/mr's), but this requires separate runner configuration stacks, which is quite superstitious. It would be great if we could prioritize deployment steps above less important testing et al.
Furthermore, would it be possible to make this priority variable also usable in the rules: workflow?
This would allow for things like:
@avielle@samdbeckham@v_mishra - I want to get your eyes on this to see if there are smaller iterations we can work on within the quarter and without grouprunner support. Let's also think about how else we might be able to solve the problem "when there is a large queue of jobs, i want to prioritize high priority jobs, so that i can reduce the feedback loop time to developers running pipelines".
@jheimbuck_gl looking at how we use needs: keyword, a solution that exists today, not necessarily to prioritize, but to start a job the moment a pipeline is created is by using needs: []. Can we promote the same or make some changes to it to fit our use case better?(don't have very concrete ideas here).
From the description it appears that the problem specified is for users with permission to write/edit the configuration file for the pipeline. In that case the solution can very well be within the YAML definition. But IMO this will span beyond the DevOps engg and team lead persona to developers who might want to prioritize a job post pipeline creation, when they are low on confidence about the performance of the pipeline. That is when we can consider bringing this functionality to the UI(even the graph).
Thanks for the insights @v_mishra. I wonder if needs: solves the problems as described enough (eg; I can tell this job to run ASAP in a pipeline) OR if we still need to prioritize those jobs in the context of all the pipelines queued up on an instance.
cc @jrreid you mentioned below that Job priority was more important than pipeline priority. I'm curious about your take on that nuance.
Looks like this one didn't post Here's what @samdbeckham from two days ago had to say
@jheimbuck_gl@v_mishra This is interesting. I've done the needs: [] trick a few times in the past to get a job to run independantly of the other jobs in the pipeline. It feels like a bit of a hack though as you're not really saying, "prioritise this job" but more, "this job doesn't need anything so start whenever you're able".
The description states. "run before other pipelines' jobs." which is interesting. This suggests a way of prioritising a job not only over the other jobs in the same pipeline, but over other jobs on the same runner(s) / group / project / whatever. This seems like a much harder problem to solve
@samdbeckham totally agree, that's a hard problem to solve. prioritizing within a project might be a good first step but i'm not sure it solves for the problems described "I don't want jobs that deploy code to wait for jobs running tests" since many projects may be using the same runner pool. I'd also expect this wouldn't apply to Runner SaaS or would potentially be a Self Managed only type feature even.
I agree with starting within a single project scope. It's also not necessary "deploy vs tests", but rather "fast unit tests vs slow integration tests". At times, I see all our fast tests for a project getting scheduled first, and our slow integration tests get scheduled last. It would drastically reduce our pipeline speed if we can ensure that our slow tests are always scheduled first, and the fast tests get scheduled last. In our case, this will still allow all our tests to finish before our Integration tests finish.
In short, being able to prioritise within a single stage within a project would be a great first step.
@bernhard.breytenbach this is exactly my issue. I have an XMOS simulation test job that is by far the longest running job, but as is, also ends up being one of the last jobs to get picked up, meaning the pipeline ends up running for about 15 minutes with just that job running.
I've currently resorted to creating a "high-priority" runner which can pick just that job up if it doesn't get picked up by the standard one I've set up.
@drew@fabiopitino@mbobin do you know how large a task it would be to add priority to the job queue so that some jobs are picked up by a runner first? Can this be done without any changes to runner code?
@avielle yes, it can be done without changing the runner, but it's a strange ask because in the description the priority is added to the yaml, but it states that it should be restricted to Owners, Maintainers and Admins. That doesn't work because developers can change the yaml and can be abusive because a project from within a group can set lower priority for all of its jobs.
I don't know if adding it to the YAML is the best idea because you can't modify its priority after it was created.
Prioritizing jobs in the same pipeline can be done with stages and in the same stage by the order in which they are created so changing the jobs definition order in the yaml should affect the priority.
@mbobin great call outs and I agree - I don’t think YAML is the only way to achieve this. I actually think some mechanism in the UI might achieve this better especially in relation to permissions of action cc @jheimbuck_gl
@mbobin@jreporter Thanks for pointing this out, Marius. Looks like we have some more solution validation to do before we work on this
@jheimbuck_gl we also need to figure out the details for prioritizing jobs. For example, in the current proposal, projects with lots of pipelines might find themselves in a situation where pipelines never complete because new pipelines are created at a rate such that the queue of priority 1 jobs never finishes so lower priority jobs are never picked up. Maybe we can do something like calculate a priority based on comparing the assigned job priority with the length of time the job has been waiting
in the same stage by the order in which they are created so changing the jobs definition order in the yaml should affect the priority
@mbobin This doesn't seem to bear fruit when I tried it, at least not with includes. I ran a pipeline and jobs defined in includes further down the list started on the same runners as jobs declared much higher.
@david.lowndes1 that's an interesting case. I wonder if the priority only applies to jobs defined within the .gitlab-ci.yml and not the includes that fall into the same stage?
@jwoods06 I wonder if solving the problem for includes would help the original use case, WDYT?
@avielle I've been reading this as prioritizing jobs within a single pipeline, not reordering jobs within different pipelines so I think we would avoid that situation. But the confusion says to me we really need more problem validation. What all use cases need solve, what's most wide spread and the biggest pain point?
I'm not sure if solving for includes would help necessarily. My customers simply want the ability to say "This job is more important that others and should always be run first regardless of the queue."
Though I do acknowledge the comment above where if there are too many priority 1 jobs that the queue will never allow any other jobs to run. But that would seem to be a case of over-prioritization and the customer would then have to re-evaluate their priorities and/or their Runner fleet capabilities.
I've been reading this as prioritizing jobs within a single pipeline, not reordering jobs within different pipelines so I think we would avoid that situation
Though I do acknowledge the comment above where if there are too many priority 1 jobs that the queue will never allow any other jobs to run. But that would seem to be a case of over-prioritization and the customer would then have to re-evaluate their priorities and/or their Runner fleet capabilities
This seems fair to me. If we moved forward knowing that this situation could create an issue for some users, I think we should surface some sort of notification if their queue gets saturated
how would priority per project help with getting those jobs in priority?
I was thinking about prioritizing within pipelines for a single project not across projects. So pipelines would still be FIFO and then within the pipelines for Project A jobs could shuffle around.
This seems fair to me. If we moved forward knowing that this situation could create an issue for some users, I think we should surface some sort of notification if their queue gets saturated
So truly knowing if users are getting the needed data sooner implies we need some telemetry on the jobs first, like Job duration over time. With that in hand users could optimize the jobs or parallelize them potentially.
I was thinking about prioritizing within pipelines for a single project not across projects. So pipelines would still be FIFO and then within the pipelines for Project A jobs could shuffle around.
Yes, that makes sense, my question was a little different though. It was more about having multiple runners per project - would you say all jobs would get assigned to a runner just based on their priority?
@zengqingfu thanks for the ping! We are hoping to work on solving the problem described in this issue sometime between Feb - Apr 2022 but may find the effort is too large to solve during that timeline.
Having another use case is always helpful if you can give us a quick one or two sentence description of your problem if it differs? Thanks!
ok. We have many jobs in our pipeline and use the same runner with the same tag, so we need to set the priority of each job and then we can make the jobs with higher priority run ahead. It would be very useful if our runners are limited.
I have a use case where I think job priority defined in .gitlab-ci.yml would be useful. There may be other ways of doing it, but I'm not aware of any good (aka deterministic and predictable) ones. I also have another use case that's more of a hack.
To schedule bottleneck jobs first in a needs graph
I have a set of software components A, B, and C, and an infrastructure component Inf.
Each can be built separately, but A, B, and C need Inf when they're tested.
If I have more than one runner available to process, building Inf will be a bottleneck. If it's built last, no testing will happen until it's done.
I'd like to bump its priority so that it runs before the other jobs. On average, we should see better job throughput with this. My actual pipeline is a good deal more complicated than this, and should see quite significant improvements.
In this case, I priority would be something between jobs in a single pipeline, not across the entire GitLab instance. This should fix the starvation issue mentioned above. This might complicate your runner job query through.
A hack to efficiently utilise specialised runners
A subset of runners in my fleet have GPUs attached. These runners can run regular build and test jobs just fine, plus can also run specialised GPU jobs.
The problem I'm seeing is that these runners often get clogged up with regular jobs (because it picks anything that matches any of its tags) and GPU jobs back up. There might be some fancy footwork with multiple runners throttled by the runner's concurrency option that might get us closer, but afaik nothing that does quite what I'm after.
If I had a priority setting on each job, I would bump the priority of the GPU jobs so that they are favoured by the runners that can process them. This is probably not a good reason to use priority, and a better "ideal world" solution would configure the runners themselves to favour certain tags. But you gotta use the tools you've got... and it won't be the worst GitLab CI hackery I've done.
starting with build-test and randomly selecting another build
followed by integration-test and the remaining build/tests.
Worst case would allow you to deploy in 15 minutes, 6 minutes longer:
starting with build-staging and build-production
continuing to build-test while the second worker waits
running second-test and third-test
running first-test after second-test finishes
running integration-test last
Our actual pipeline is a bit more complicated than the one above. But in the above example, the order can really speed up the pipeline for a single project. Since this is known by the developers, adding priority to the gitlab-ci.yml file would address this problem.
@bernhard.breytenbach thanks for this example it's really great! I have a follow-up question
Do you know which job is the long running one today? - in this example it's integration-test but is it as easy to spot in your actual pipeline?
Could you get the same result by making the job integration-test run in it's own stage ordered before the current test stage? If you're using needs: the jobs in test would still run along side it but it could run first.
Yes, we know ahead of time which jobs take the longest.
No. I don't think that runners prefer jobs in earlier stages. They will pick up any job that is ready. I've seen jobs in later stages complete before jobs in earlier stages have even been picked up, even if they all became "ready" at the same time. And if you didn't use needs:, it will have to wait for integration-test to finish before starting the other tests, while some workers might be idle, creating an even worse situation than we have right now.
@bernhard.breytenbach If you know the job durations and have a specific number of runners, would it make sense to configure as many execution paths as you have runners?
While understanding that this change is simpler to make on your simplified configuration, and may be somewhat more complicated in your real configuration, that level of complexity might be required for configuration writers in general to make this kind of solution work.
I'm worried about defining some kind of priority keyword and trying to make it "just work" across other use cases. The configuration has no idea how many runners or jobs there are, so I'm not sure there's a reliable way to make this kind of tradeoff prioritization decision automatically under the hood.
thanks @bernhard.breytenbach - I'm playing around with using needs: with a job in the same stage here but not sure i'm making any progress on your problem.
Another case where job prioritization would really be usefull:
We are using renovate to make sure our package dependencies are up to date. But a renovate run can easily create 100's of update MR's in very short time, and when this happens ordinary developer MR pipeelines will be put back in the queue and my developer colleages will be blocked while the CI runners are busy with the (low-prio) update pipelines.
We (FreeDesktop.org) are now also starting to need this, in order to prioritize pipelines merging MRs over other pipelines.
As others have mentioned, a static priority: wouldn't work though, we need it to be part of rules: to be usable.
I would also suggest inverting the logic: 0 would be the default priority if not specified (and the minimum value), and anything above that would be reducing the job's priority: this means you can't go and say "my job is more important than everyone else's" and set it to some huge value, but instead you can yield and let others go first.
That means the worst case when using priority will be the current behaviour before priority; otherwise, the worst case is one user setting a priority higher than everyone else and starving the entire system, which is unacceptable for us (we already have enough issues with that kind of behaviour ).
We also as premium customers need this. We have jobs which run in parallel and we might hit the slowest job at the end of the chain. We would like prioritize this job so that the whole pipeline would have as total elapsed time the same amount of time of the priorizized job.
For example, we have 8 jobs for 4 runners, where Job 5 is the slowest one, the Job 1, Job 2, Job 3 and Job 4 get started and Job 5 is enqueued.
Job 1 - 19 minutes
Job 2 - 6 minutes
Job 3 - 4 minutes
Job 4 - 2 minutes
Job 5 - 20 minutes
Job 6 - 5 minutes
Job 7 - 7 minutes
Job 8 - 9 minutes
As you can see the pipeline could take 20 minutes in total if we prioritize some jobs, without this the pipeline would take 39 minutes.
So more 19 minutes to delay a deployment and to block other pipelines because runners are unnecessarily in use.
@gdoyle / @jheimbuck_gl (you likely have new responsibilities now but have been tagged in similar issues) - I was wondering, and hoping, that one of you might have some insight into the plans in this general problem space. I've got a customer looking to move their jenkins processes into GitLab and the ability to allow for prioritization of jobs is an important aspect of their build process. Any insight you might be able to share would be most welcome.
For the record, our usecase as freedesktop.org is that we have blocking pre-merge CI pipelines which test on real hardware devices. There's no way around this - we develop the Linux kernel and hardware-specific GPU drivers, so to know if they're any good we have to run them on the hardware. Obviously there's only a limited number of hardware devices we can have, unlike generic runners which can be provisioned on-demand.
Our model is that we want to allow anyone to run these jobs at any time: we support dozens of different hardware models from almost a dozen different vendors, so it's not feasible for everyone to have them on their desk. But it's not feasible for us to have 100 of each racked up either.
So this hurts us when we want to support rapid pre-merge CI (because the throughput gives us a strict limit on the number of MRs we can put through), whilst also allowing non-blocking CI usecases like targeted developer testing ('why doesn't this code work on this exotic device I don't have?') and advisory post-merge CI.
Having a strict prioritisation would really help us, by making sure that we can guarantee throughput for the time-critical MR CI, whilst leaving others as best-effort.
Why interested: Ability to prioritize concurrent pipelines to distinguish runner resource priorities to focus on prod jobs and deprioritize / hold testing jobs until prod jobs are completed. Alternatively, test jobs should not be able to overshadow resources that should go to prod job runs.
Workaround: Suggestions? Do we have a prescriptive way to use runner tags / job tags for this?
We do not have a prescription today @cupini this is becoming So we are looking at this on the technical side now. As a note, this is a hard problem to solve for our system.
Based on pipeline rules and variables in the job tags (e.g. commit author is a bot, schedules, or using git push -o ci.variable="BULK_TAG=bulk", use an empty string to not add the tag) we then enforce that such pipelines only ever run on the bulk runner, always leaving the other runner available should a time-sensitive pipeline need to be run. If there is capacity leftover in the bulk runner it will still help with the normal jobs.
You can use this if:
you control the ci/cd config (need to add the bulk tag under certain conditions)
You get a little of bit of duplicated config in the workflow:rules section if you use it heavily, since you need to add the bot and schedule check with variables separately
We use a technique like this at gitlab.freedesktop.org for Mesa CI (and other projects), but this still isn't sufficient for our needs for multiple reasons:
We associate gitlab runners with real-world machines, and we can't afford to have some machines idle when the high-priority bot isn't using them... but we also can't expose the same DUT via 2 different gitlab runners since both could pick jobs at the same time (unless you create a wrapper to gitlab-runner to only check one "queue" at a time like done here https://gitlab.freedesktop.org/eric/gitlab-runner-priority)
Any fork is free to change the list of tags to run on the non-bulk runners
Ultimately, I believe we need two mechanisms:
Make every pipeline faster by allowing users to increase the priority of jobs that are in the critical chain (nice to have, but non-critical since users could already abuse needs to enforce the order of execution of jobs)
Allow runners to chose which jobs they consider being more important so that the priority policy could be implemented on a per-runner basis without possibility of tampering by malicious users.
The latter could be done relatively simply by adding 2 REST endpoints to gitlab:
List of runnable jobs that could be run on the runner making the query
Acquire a job by ID
And then the gitlab-runner could simply allow users to specify a script to execute every time there are more than one available job and let the script specify which job id to execute next (or none)
Problem they are trying to solve: The customer has a limited number of runners and can't autoscale. So sometimes they have jobs in the queue, and some jobs are more important than others and should run sooner.
One classic scenario we see ourselves into is that, we release a new version of our software through a new tag on the trunk, the CI/CD runs just fine. Then, there is an important regression in production so we want to fix it really quick. We create a new merge request with the fix, the CI/CD runs and validate things, we merge, the trunk runs another pipeline and then we tag again which runs the deployment pipeline and all of this happen when the rest of the team is working normally in their MRs triggering their own pipelines for development purpose.
So, what happens is that the hotfix and the associated deployment pipelines see their jobs constantly waiting for runners to be available without special prioritization.
Yes, I suppose the workaround could be to have dedicated runners for the deployment pipeline but it seems overkill and in our case it would generate new waiting time because it would need to provision brand new runners (we have our own running leverage AWS EC2 machines). Also, having more concurrent runners doesn't seem like a good idea either as this is a prioritization issue.
If at least we could tell the deployment pipeline (those triggered from a tag) to be prioritized, it would help.