Remove the feature flag depend_on_persistent_pipeline_ref
Status
This feature is currently behind depend_on_persistent_pipeline_ref feature flag, which is disabled by default.
We'll evaluate this feature in %12.4 and remove it if it's confirmed as deemed stable.
Timeline
Action
Oct 9th 3:00 AM UTC
The feature flag depend_on_persistent_pipeline_ref is enabled on all projects on production
Fri Oct 4 07:15:43 UTC 2019
The feature flag depend_on_persistent_pipeline_ref is enabled on gitlab-com/www-gitlab-com and gitlab-org/gitlab on production
Problem
When user uses Pipelines for merged results, pipelines run on a (prospective) merge commit instead of a feature branch. This temporary merge commit is generated at refs/merge-requests/:iid/merge and it's overwritten per 1. Source branch is advanced 2. Target branch is advanced. This causes the following problems in a development flow, especially in a busy repo:
A developer pushes a new commit to a merge request, which targets "master".
The prospective merged result is renewed. The SHA is SHA-A.
A pipeline for merged result is created and run on the merged results (SHA-A).
The other MR has been merged into master branch i.e. master branch is advanced.
The prospective merged result is renewed. The SHA is SHA-B.
The pipeline ran for SHA-A is invalidated (User sees fatal: reference is not a tree: in job log) thus failed
The developer cannot test code in the merge requests.
This is a crucial problem that developer cannot really see if the new code passes tests or not.
Workaround
Make the MR WIP to trigger detached merge request pipelines. This pipeline runs on source branch thus correctly finishes regardless of the target advanced case.
Use merge train. Although, merge train is used when a maintainer decided to merge something and it cannot be used in the development flow.
Proposal
Create a refs/pipelines/:iid ref per pipeline
The system ensures a corresponding pipeline ref exists when build status transits to running.
Runners fetch code from the pipeline ref instead of source branch, refs/head or refs/merge.
The system cleans up the pipeline ref when pipeline finished i.e. no more active jobs.
@ogolowinski This is important problem to fix for continuing dogfooding. Can you take a look and prioritize? Let me know if you need further explanation.
Make the MR WIP to trigger detached merge request pipelines. This pipeline runs on source branch thus correctly finishes regardless of the target advanced case.
Would that make sense to fallback to the detached pipeline whenever we can't find the SHA we were expecting? That also feels as a workaround as well, given we might fallback to the detached pipeline even in the first pipeline if we're dealing with a busy repository.
@oswaldo The tricky thing is that it happens during the pipeline execution, which means the first job of the pipeline runs on merge commit, however, the latter job of the pipeline could fallback to source ref, this would be very confusing.
Fetching changes with git depth set to 10...Initialized empty Git repository in /builds/gitlab-com/www-gitlab-com/.git/Created fresh repository.From https://gitlab.com/gitlab-com/www-gitlab-com * [new ref] refs/merge-requests/29549/merge -> refs/merge-requests/29549/mergeChecking out dcfcbe39 as refs/merge-requests/29549/merge...fatal: reference is not a tree: dcfcbe393f25412b41798cf6b518512c2aca8401
Actually, I think the current situation is not as bad as I thought. Here is what's happening on www-gitlab-com now:
A developer pushes a new commit to a MR
The MR creates merge ref (SHA: AAA)
The MR creates a pipeline that runs on the merge ref.
Runner clones fresh repository because project.build_allow_git_fetch == false. This wipes out all previous merge SHAs stored in the build dir. So it can understand only SHA: AAA.
The target branch is advanced and the MR creates merge ref (SHA: BBB)
The developer retries an existing pipeline which ran on SHA: AAA.
Runner clones fresh repository because project.build_allow_git_fetch == false. This wipes out all previous merge SHAs stored in the build dir. So it can understand only SHA: BBB.
So the point is that, runner removes existing repository everytime when a job runs, however, if we use GIT_STRATEGY: fetch or set "git fetch" to be a default strategy in project-setting i.e. project.build_allow_git_fetch == true, runner doesn't remove the existing repository so that SHA: AAA is available even if the merge ref is rewritten in Rails/Gitaly side.
So it seems to me that our current solution is enabling the following option in www-gitlab-com
Remember, this option chooses git fetchby default, so this would not affect the other projects/organizations as I'd assume www-gitlab-com disables it because it's very old project.
I'd assume the checked out repo in runner side is a best-effort cache and could be wiped at some points, but I think there is enough window to ensure that the merge ref exists in reasonable period.
This is a risky assumption. Runner by default falls-back to clone if it doesn't have local sources and this may happen in many cases. I'd not relay on the fetch strategy for this problem.
Runner by default falls-back to clone if it doesn't have local sources and this may happen in many cases
Can you elaborate when this happens? Could the local repo be removed by any chances aside from the git strategy? How do shared runners share the checked-out repo (if strategy is fetch) or is it deleted every time job finished?
Could the local repo be removed by any chances aside from the git strategy?
Yes. Simple example - user manually logged on a machine and deleted the directory. It always may happen that the directory will be removed, for any reason. In that case the script executed by Runner will detect that the director is not existing, will re-create it and switch to cloning.
How do shared runners share the checked-out repo (if strategy is fetch) or is it deleted every time job finished?
It's not shared between different Runners (understood as [[runners]] entry, so the granularity here is smaller than the Runner host!). Within the same [[runners]] worker on the same Runner host it may be preserved, but this highly depends on the used executor and configuration.
Look on GitLab.com example:
Jobs for GitLab CE/EE are using Shared Runners, that are re-using autoscaled VMs up to several jobs. This means that on the VM the code is stored after first clone (in a special Docker volume in our case). But the jobs are distributed across four Shared Runner Managers. On each manager the machines are autoscaled. It means that your job may hit a manager and a VM that already contains a code. But it may happen that the job will get a newly created, fresh machine. In that case Runner must force clone. Otherwise we would be unable to use autoscaled runners.
Other jobs that are using the general Shared Runners are distributed across four Shared Runners Managers configured to drop the autoscaled VM just after one job was handled (for security reason). This means that fetech strategy will never be usable on these Runners. If someone uses our Shared Runners together with Pipelines for Merge Requests, then using fetch is not a solution at all - this workaround is not available for such users.
Yes, I agree with @tmaczukin, it is very strong assumption.
It seems that the problem is that since target branch advances often, we cannot never finish merge train, but it should never happen if only merge train controls when changes are merged into master. If there are different actors advancing the master this will behave exactly as described by @dosuken123.
Isn't that a problem that we use Hybrid approach? Merge train + other actor advancing master?
Maybe the solution is to not advance automatically on target branch change, but rather: when we complete the current merge train, and when we try to check if we can merge, we find a merge unable to happen, and we restart train.
This will result that we will discover the need to refresh merge train when pipeline changes, but it will finish in the end.
@ayufan are you suggesting that the default behavior should be to rerun the merge train in the case where the target branch has advanced while the merge train pipeline is running?
This seems like it could result in a situation where the merge train could rerun many times before it can successfully merge, as you say, because we are using the hybrid approach.
Even given that, this does seem like the correct behavior for this situation. Instead of failing and leaving the process in an unusable state, the system will attempt to recover and the process will succeed eventually. Then, we could potentially address the downsides of the hybrid approach in another issue.
@tmaczukin Thank you for the detailed answer. So simply put, the probability of running a job in the same machine is 1 / runner_managers.sum(&:number_of_machines) (excluding general one-off cases), which means, even if the number of machines is 1, the chance to avoid clone is 25% at most. This would be unacceptable on www-gitlab-com usage. We'd need a more robust solution than that.
@ayufan@darbyfrey First of all, we're talking about Pipelines for merged results, not Merge Train. Merge Train works fine. But the problem is that developers cannot verify their code before getting it on merge train, because pipelines cannot finish successfully because refs/merge is updated during the pipeline execution.
Given fetch option is not viable solution as discussed above, the next solution would be using refs/train for pipelines for merged results. refs/train is not affected even if the target branch is advanced becasue it's outside of the GitLab refs/merge lifecyle. The ref is not used until the MR gets on the merge train and when the MR is on the train, users basically do not create pipelines for merged results. So there are no conflicts from functionality perspective.
However, there is one limitation that users cannot create multiple MR pipelines in a short interval because every push rewrites refs/train. For example,
A developer pushes a new commit to a merge request. refs/train is rewritten.
A pipeline for merged result starts the job. (Pipeline-A)
A developer pushes a new commit to a merge request. refs/train is rewritten.
A pipeline for merged result starts the job (Pipeline-B). Pipeline-A starts failing because it cannot find the associated SHA on the previous refs/train.
Actually, this makes me think that we don't even need to run pipelines for merged results, because it's evaluated in merge train anyway. So running pipelines on source ref is totally fine while the development.
So I can think of the following solutions today. Let me know what you think.
Solution 1: Create merge refs per pipeline
Proposal:
The problem is that the system has only one merge ref refs/merge and it can be overwritten at an unexpected timing. We should create a dedicated merge ref per pipeline for ensuring that the pipeline can finish until the end.
For example, there is one merge ref today
refs/merge-requests/1/merge, but we can create a ref per pipeline, which
formatted in refs/merge-requests/<mr-iid>/pipelines/<pipeline-id>/merge. For example,
This is the safest option. Pipelines are not intervened by target/source advanced cases.
Users can retry after the pipeline finished.
Creating refs is not a cheap operation. This could gratly impact Gitaly.
It's hard to determine when the system should garbage collect the old refs.
As the system have more refs, the fetch/clone could be slower.
Solution 2: Always checkout the latest SHA of the merge ref (Runner)
Proposal:
Pipelines are persisted with SHA of the ref. This is for ensuring that all pipeline jobs
checkout the same SHA (i.e. the same code). However, this doesn't work well for dynamic
refs such as refs/merge, because it's updated at an unexpteced timing.
Give an option to runners, and if it runs on refs/merge, runner always checks out
the latest SHA, not the persisted SHA. This means, we'll replace git checkout <SHA>
by git checkout refs/merge-requests/1/merge
Thoughts:
This would be one of the cheapest solution from performance standpoint. We can use refs/merge directly.
The shown SHA on the pipeline UI cannot be trusted. The actual SHA used by runner could be different.
Solution 3: Disable Pipelines for merged results by default and make it opt-in
Proposal:
The system creates detached pipelines while development and merge train pipeline when it's merged. But pipelines for merged results do not run unless users explicitly enables the project-level option (which should be turned on only non-busy repos).
Thoughts:
This would be the simplest solution (or more like workaround).
Users cannot use pipelines for merged results by default.
Interestingly GitHub does the same, the have a single /merge that represent latest version.
My proposal
I really wonder if it is possible to simply request the given SHA by runner instead of refspec and say that for using merge results, you simply need latest Git. These loose refs will be recycled after sometime. This seems to work:
So, we would use a single refs/merge-request/iid/merge, but instead of requesting ref, we could send in refspec exact SHA to request.
There's a small window that these revs would be recycled by GC, but maybe we can create some internal tmp reference for them and remove them after sometime.
Shinya's proposed solutions
Solution 1: Create merge refs per pipeline
This is the only option that seems to make sense for me if the above will not work. One thing to add is to ensure that we cleanup after ourselves. If we gonna have all dangling refs for each pipeline and merge request this gonna make the whole system to struggle.
Creating refs is not a cheap operation. This could gratly impact Gitaly.
We are already doing that, so the cost would be identical, we would only store it elsewhere.
It's hard to determine when the system should garbage collect the old refs. * As the system have more refs, the fetch/clone could be slower.
It should be fine as long as we keep a number of refs minimal, so no dangling refs.
I know that Git has option to fetch specific sha instead of ref. Maybe this could be used on newer versions if possible.
Maybe the simple solution to degenerate pipelines after sometime, like 7 days, and thus we also remove these refs would be fine. We already have the option to archive? builds, but maybe actually executing this now would make a sense. Likely to think in future.
So, maybe just pushing sidekiq job with time in 7 days from now, to remove the given ref would be the way to move forward to fix the explosion of refs? IDK if this is good idea, but some cleanup needs to be there.
Proposal 3 is a workaround and will not work for highly used repos where people want (or need to) use Pipelines for MRs.
Proposal 2 breaks our design of using exactly the same code by all jobs in the pipeline. If I have jobs tests 1 and tests 2 with all tests distributed across these two jobs and they will use two different codebases (from two different versions of the merge ref) then how I can say that I've fully tested the code?
Proposal 1 seems to be the only proper solution here. Having some defined time of jobs/pipelines archival connected with removal of such references should not create as big impact on the whole system.
I also like the idea of fetching a specific SHA. In most cases the SHA should be still present at GitLab. @ayufan do you know which version of Git is required to use this?
I also like the idea of fetching a specific SHA. In most cases the SHA should be still present at GitLab. @ayufan do you know which version of Git is required to use this?
Something like 2.x. I don't know. We can easily test that.
Not sure if this is at all possible, but is there any way to know when a pipeline starts and stops? If we knew that then it seems like we could create the ref when the pipeline starts and remove it when it finishes. That way we could always start with a current sha and wouldn't have to worry about dangling refs.
@darbyfrey Also it's also not as easy as "remove the ref after the pipeline is finished", because in that case retrying a job is impossible then. The pipeline has the ref and SHA stored forever and any retry will try to use this pair forever.
Using the ref dedicated for a pipeline that is removed after some time or using the SHA fetching (where the SHA is available until next GC run) gives some additional time for the user to retry the job if wanted/needed. Using a ref that would be removed at the moment when the pipeline is first time transitioned to a finished state, means that any further retry will fail.
That makes sense. I can see why we would want to support the ability to retry. I guess what I was thinking was that if it were possible to create a new ref at the start of the pipeline run and then destroy it at the end, then it could isolate that branch to the specific pipeline run. In the case of a retry, it would just start the process over again by creating a new ref.
However, if creating a new ref for each run is prohibitively expensive, then this probably wouldn't be an option.
@ayufan@tmaczukin@darbyfrey Thank you for the input. I also agree with fetching SHA isntead of featching ref would be the best option. I'll take a look it closely today.
Regarding the ref cleanup timing, we can basically clean it up when MR is closed/merged. It'd gives enough window to users to retry pipelines.
Regarding the ref cleanup timing, we can basically clean it up when MR is closed/merged. It'd gives enough window to users to retry pipelines.
We do not control that, but we could create bogus links that would prevent some shas to be removed for some time. We use that with tmp/ refs today I think.
However, if creating a new ref for each run is prohibitively expensive, then this probably wouldn't be an option.
@darbyfrey It is not only expensive, it is practically impossible to produce exact same sha. You would need to ensure that you the exact same timing of operation, and likely some other factors as well. It means that this would be flaky to ensure that we always get the same SHA. Pipeline is created for given SHA, so we cannot change SHA mid-way.
I was investigating the @ayufan's proposal that fetches specific SHA without specifying refspecs. This in not currently available because gitlab.com doesn't allow users to set uploadpack.allowAnySHA1InWant option today, which is required to be true if we want to accomplish that.
However, there are some on-going discussions to enable the option on gitlab.com, which seems happening for %12.3 release. If we can ride on the wave, this issue would be basically easy fix that we only changes Rails side to send +<SHA>:<merge-ref> as refspec to runners. One note is that we don't enable it for all projects/users immediately that it's basically expensive operations thus there are some performance concerns. It could take a long time in the worst case scenario.
If we want to stick with a simple and performatn architecture, we'd want to wait for allowAnySHA1InWant option is being rolledout, otherwise, we have to go for the solution 1 I proposed that makes merge refs per pipeline.
I personally would wait for uploadpack.allowAnySHA1InWant.
Maybe if this gonna be feature flag on project side, maybe we could detect that allowAnySHA1InWant is used and simply use it or use existing refs/.../merge (somehow broken). If we follow that, it will be self-solved for all projects that have it enabled.
There is an alternative approach that downloading source code via archive API. Something like:
# In runnerwget -O archive.zip https://gitlab.com/gitlab-com/www-gitlab-com/-/archive/3674ffc3967c73bfd160d640be151ab590d59e2atar -xvf archive.zipmv archive <builds-dir>
This way, we don't need to specify refs. Since it's just a set of files, users cannot use git command in the builds dir, which could be a significant limitation thus our efforts wouldn't be paid off.
I wonder. Maybe we could still "link" the relevant refs on pipeline lifecycle. This will just generate quite amount of pressure on info/refs being often discarded if we cache it. It should be possible, since the sha should be hold by keep-around it should work to dynamically link it to expected exposed ref, and remove it once pipeline transitions to state success.
Maybe, even do it on build transition to pending, and cleanup it up on pipeline transition to finished state.
It seems that we could hook that after_transition pending => :running and ensure that given ref does exist for build. If not, link it to :sha. If we cannot link :sha we could fail this build with appropriate failure_reason.
Then, if pipeline transitions to finished state, we could remove this auto-generated ref.
This should pretty well handle retries as well, and not require to use git fetch SHA.
@ayufan Actually, I came up with yet another approach.
Solution 5: Retain history on the ref
We use refs/merge-requests/:iid/retain, not refs/merge-requests/:iid/merge.
When users push a new commit or create a pipeline for merged results, it generates refs/retain from the HEAD of target and source.
If the ref has already existed, continue to the next bullets.
It merges the source branch into the refs/retain (or cherry-picks the diff commits), if it's advanced.
It merges the target branch into the refs/retain (or cherry-picks the diff commits), if it's advanced.
It creates a pipeline on refs/retain with the latest sha.
If force-push happened on source or target branch, it regenerates refs/retain.
Thoughts:
The number of refs will not significantly increase, as compared to the create-ref-per-pipeline approach.
Additional git operations wouldn't be cheap.
If force-push/squash happens, the previous pipelines would fail. (But this is already the case with branch pipelines)
I think that it is in general nice feature that we could "we control" force-push behavior, so, I think dynamically creating refs would be likely better solution.
You still have the problem with git_depth, as a lot of new commits can disallow us fetching older revs.
I'm still uncertain if we will not have merge conflicts even if not a force push happens, but regular push, where in other cases we would merge things just fine (unlikely, but who knows).
In general this seems interesting, but I still wonder if creating dynamically would simply not be better.
Here is a rough assumption if we go for creating ref per pipeline and advertise it e.g. refs/pipelines/iid. The number of refs increase around 1000 on gitlab-org/gitlab project every day, for instance.
If we retain the dedicated refs for 7 days (for retry purpose, mainly), the additional number of advertised refs would be around 7000, always.
Here is the comparison to the other ref types.
shinya@shinya-MS-7A34:~/workspace/thin-gdk/service/rails/src$ grep -a 'refs/merge-requests/' /home/shinya/Downloads/refs | wc -l14272shinya@shinya-MS-7A34:~/workspace/thin-gdk/service/rails/src$ grep -a 'refs/environments/' /home/shinya/Downloads/refs | wc -l32578shinya@shinya-MS-7A34:~/workspace/thin-gdk/service/rails/src$ grep -a 'refs/heads/' /home/shinya/Downloads/refs | wc -l2120
I wonder if we should schedule #26201 (closed) as well for reducing the total advertised ref sizes and make some room for the new pipeline-dedicated-refs.
Well, actually I think refs/pipelines/:sha might be better in this case. This way, pipelines which run on the same SHA use the same ref, thus the total number of refs would be smaller.
The expiration policy is simple. Hook to PipelineFinishedWorker as described in: #14863 (comment 217553367).
The simplest, and the most stable. I would consider that if we hook to pipeline lifecycle, and have a common sha, like refs/pipelines/:sha then we have multiple pipelines managing this sha, this will result in race conditions. Having then refs/pipelines/:iid would be more stable in such cases, as you are sure that you are only controlling the pipeline.