i am running my own GitLab instance and have a single runner and a build CI file with 7 jobs split in 5 stages. At random times, runner seems to grab jobs from subsequent pending builds instead of fully finishing jobs from the current one. Since my builds depend on a 3rd party resource lock, this mixing leads to build failures (ci yaml attached). I can confirm I only have one runner registered in the system (its config is attached below).
I don't know whether this is due to a bug in GitLab or Runner, but this behaviour was present in various versions of 8.13.x, as well as 8.14 of GitLab, and 1.5.2, 1.7.2, and 1.8 of Runner. The reason I know is because I was upgrading in hope of fixing the issue :-) Cancelling & restarting pipeline builds is tiring.
Steps to reproduce
create new project
make sure one runner is available
add .gitlab-ci.yml (attached below)
create a batch of pending jobs (i ran one line change, one commit, one push sequence a few times)
watch pipeline page (see screenshot attached)
Actual behavior
Runner randomly takes jobs from the next pending build instead of finishing the jobs in the current build first
Expected behavior
Runner is expected to finish jobs in the current build first before starting with the next one
@alex-ufim thanks for opening this issue. Job scheduling is actually a gitlab-ce concern - the runner isn't in control of what jobs it receives - so I'll move this issue onto the gitlab-ce tracker for you where it might get more attention.
Certainly within a branch, I'd say it makes sense to schedule all jobs in pipeline N before considering pipeline N+1, at least in scenarios where you want every pipeline to run (an alternative is that you'd want a new pipeline appearing on the same branch to cancel any existing pipelines).
Nick ThomasChanged title: Runner mixes jobs from different builds → Jobs from later pipelines for a branch are sometimes scheduled before earlier pipelines are completed
Changed title: Runner mixes jobs from different builds → Jobs from later pipelines for a branch are sometimes scheduled before earlier pipelines are completed
Nick ThomasMoved from gitlab-org/gitlab-ci-multi-runner#1910
Thanks for the swift answer, clarification, and moving the issue to the appropriate place! I have a couple of thoughts regarding your comment:
Certainly within a branch, I'd say it makes sense to schedule all jobs in pipeline N before considering pipeline N+1, at least in scenarios where you want every pipeline to run (an alternative is that you'd want a new pipeline appearing on the same branch to cancel any existing pipelines).
I reckon ordering jobs within builds (pipelines) makes sense not just for a branch, for a project overall, since pipelines are shared per project.
As for pipeline scenarios (every push gets processed by pipelines vs new pipeline cancelling old ones), I think the former should be the default as it is more likely to be what GitLab users want. For example, anything that gets merged into master should be compiled, deployed to staging, and tested (or packaged and tagged), while if someone wants to drop everything and just build the latest push, they can cancel currently running and pending builds manually.
I'm facing the same issue where some the first stage of a pipeline from a newer push gets executed before the pipeline from a previous push is finished.
That's dangerous for my builds since I cache some stuff between stages which should not be shared between pipelines.
We've got bitten by current behavior as well. In our case tests expect a certain revision of a database schema. Schema update is done from the early stage before test jobs. Unfortunately we have only one (shared) database so we need all tests from a certain pipeline to finish before starting a pipeline for another commit. Unfortunately GitLab tests multiple commits simultaneously so we can't guarantee that only one commit is tested at a time... Any ideas?
I'm deploying in a shared staging environment before running system test, so this issue also affects my projects. It's even more an issue that I cannot in a single GitLac-CI job deploy to staging and run the tests: Doing so would wrongly detects which commit has been deployed in the staging environment (talking about GitLab > Project > Pipelines > Environments) in case the test stage fails.
I feel for simplicity, by default it should perform all stages of a commit before picking up another commit of the same project. For people that need it to scale, they could enable parallel builds of different commits. If I recall that's what Jenkins also does.
A more granular option would be to allow linking jobs in the YML so as to say that job X must have run on this commit before running the current job. This would be a lot more complex but more powerful. as one may want per-branch, per-project, or global enforcement of this rule. I'm more a fan of simplicity over crazy features in this case.
Hi, we're facing similar issues where the shell executor builds some Docker containers in the first stage, starts them in the second and tests them in the third. Often a 2nd active pipeline jumps in and the earlier pipeline is testing a different container/code base. We need one pipeline to finish before starting another to have consistency in our CI. Thanks
This is a problem for my pipelines to! I have the stages build, deploy and trigger next. Now I need to be sure to run build and deploy before the pipeline of another project starts, as otherwise the other project will not use the artifacts provided by the prior project (which are available only after the deploy)!
I have the same issue with shell runner and shared resources between pipelines. I have a clean up stage which cleans up any potential effects of failed tests but it is not scheduled before tests from next pipelines.
@mapeltauer the only 'fix' we were able to find was creating one giant job per pipeline by merging multiple jobs above into one. It is not an ideal solution, but works for us atm.
I think we don't have any other workaround unless we change our job dispatcher in GitLab. The current model has some drawbacks, but also implementing a better one is not so simple.
This is not really about workarounds, it is more how the system is designed today. Jobs are processed as soon as they are marked as pending, the pipeline does not block subsequent pipelines from being picked. Also, it is not said that each job of the pipeline will always run on one runner, it is meant to be spread across multiple machines.
We miss a feature to lock pipeline to be finished on given runner, before picking another job. There is a long outstanding issue to track that: https://gitlab.com/gitlab-org/gitlab-ce/issues/19772. This seems like a thing to be introduced where you indicate that you want to continue your next job on the same runner, as the previous one, forcing to have a chain of operations. However, this terribly breaks retry, as then it would require to retry the whole dependency chain up to the first job.
This would allow you to follow up jobs, but it would force to reduce the parallelism to 1, as you can at most follow-up after single job, at a single time.
Retry of push or test3 would force to also retry build, as this is a job that is required to prepare an environment.
This is actually great, as this would be very fast due to caching of an environment.
This is the idea in my head how to improve that workflow, but already this seems to be very promising in its concept.
The default way would stay as it is today probably, high parallelism.
We're experiencing this issue as well. This is really disappointing. The workaround we've done is to put everything in one stage (the build stage). So now, the build stage contains the build process, the test, and spawning a review app....
If you have more than one concurrent runners the one fat job consisting of all the build steps is not even a workaround. There still could be situations when the later job will be completed sooner than the previous one. This is harmful when you build docker images and tag them with latest or $CI_COMMIT_REF_SLUG.
I intend on using the pipelines API as a workaround for this.
First thing a pipeline will do is check to see if there is another pipeline running (and not in the last phase - see (2)) for this branch and if so, bail out.
Last thing a pipeline will do is check to see if there are pipelines that bailed out and trigger the first one of them to start up
Currently the workflow is that we only want to trigger specific jobs (slow tests) on a MR and not on random branches people push. On my todo I have to also cancel older pipelines, but not much time unfortunately. Maybe this project helps you though (and please share any updates if you can).
I have also run into this issue. My pipeline jobs build up the environment in a start stage and tears it down in the last stage. Then multiple pipelines are scheduled and I run into the situation that one of the other pipelines has already started before the previous one ends. In one of the middle stages the environment has been broken because the last stage of the previous pipeline tore it down.
Why is this issue open for such a long time?
We pay a lot of money for GitLAB Enterprise and my experience so far is that issues take too long to fix.
I am also running into this issue. For cloud deployment testing in stages, it is problematic if later pipeline tests can run jobs before the previous pipeline test reached completion (or failed). It would help if we could enforce a strict FIFO for all pipelines in a branch, group of branches, or some people may have scenarios for the whole project.
We're also having the same issue though a slightly different outcome: it happens that two MRs are being merged into master in only a few seconds, and since jobs are grabbed in no specific order, somwtime the result is that pipeline #1 (closed) finishes after pipeline #2 (closed). The problem is that at the end of the pipeline, we deploy our apps on the server, so we end up with a version deployed on the server which is not from the latest triggered pipeline.
Thank you so much for hopping in here @darbyfrey. I don't think Sticky Runners would fully mitigate this bug, although - resource_groups might, I did not think of that! I will wait for them to respond and add that as a workaround to the bug. Thanks again for triaging
Why interested: The customer (1900 Premium Seats, Enterprise in EMEA/GERMANY) is affected by this issue in its daily workflows and raised this topic in the last TAM cadence call this week (17th March)
Current solution for this problem: No workaround at the moment. They could imagine a custom script in the pipeline which uses GitLab API to check for active state.
How important to customer: As the customer is facing daily problems due to this issue, it was communicated as important.
Questions: When is this issue getting reviewed for a possible release milestone within the next couple releases?
Hey @cpritchard1 sorry I hadn't seen your note until now. This issue is blocked by work on the scheduling algorithm we hope to pickup yet this year (FY23) after our current priority work.
This is pending since long.
This need to have a solution implemented.
I wonder how such basic thing can wait for 5 years and be lowered to a priority/severity 4 ?
Also nobody is assigned to work or think about it.
In this particular case perhaps we could design a new feature that would prioritize jobs from earlier pipelines, so that these are picked first by the runners. Today this is not how job picking is implemented, a pipeline age does not matter when composing a queue of jobs for processing by runners
@grzesiek thanks for the note here. it seems like this is working as intended if the scheduling algorithm is not taking pipeline order into account so a feature improvement is in order. if the new scheduling algorithm can take this into account great but and then we can work on Add configurable pipeline priority (#14559) as a follow-up for users that do not want pipelines to always run in order.
It's not a complete solution to this problem but you can cancel previous jobs when the NEW pipeline is run. This solves a problem when older jobs could affect the environment and you just don't need them to finish when the new pipeline is triggered.
To do this you need to set interruptible: true on the step level.
so kill the old pipeline when new starts? this is no solution, mister. we
need the environment to stay from the original pipeline so killing old jobs
solves nothing, and the new job is now running in the wrong environment.
It's not a complete solution to this problem but you can cancel previous
jobs when the NEW pipeline is run. This solves a problem when older jobs
could affect the environment and you just don't need them to finish when
the new pipeline is triggered.
To do this you need to set interruptible: true on the step level.
Why interested: Want to control the order of deployment/data migration jobs in their pipelines where order matters and they do not want newer jobs executing before the older jobs.
Current solution for this problem: None.
Impact to the customer of not having this:
Questions: I had an idea/suggestion that the customer try using resource_groups however I see that there is a report it does not solve this issue earlier in this thread. Do we have an understanding why setting the resource_group on a concurrency sensitive job with the process mode to oldest_first would not solve this?
@mikko7, this is possible today at the instance-level to disable fair scheduling for shared runners by disabling the feature flag: ci_queueing_disaster_recovery_disable_fair_scheduling. Note: It is intended to be used in situation when fair scheduling causes downtime.
This causes serious problems with pipelines that have a deployment step.
Someone merges to master which starts a pipeline that includes a build step and a deploy step
After the master branch build step but before the deploy step, someone else pushes up some_branch that is missing a package from the master branch (which also will run a build but not a deploy)
The build for master runs, then the build for some_branch runs (instead of waiting for the master pipeline to finish)
The deploy step for the master branch pipeline fails due to missing one of the packages master needs because it's using the build from some_branch (wrong build) OR the deployment is successful but the build causes 500 errors site-wide and customers freak out.