Proposal to improve unlocking pipelines and associated artifacts
Why?
We have multiple existing issues related to unlocking pipelines and their associated artifacts. These issues include both bug and performance issues:
- Bugs:
- Performance issues:
The previous attempt to fix the bugs resulted in severity1 incident.
From the incident, we discovered that the current process to unlock the pipeline and the associated artifacts have the potential to generate a very large amount of updates in a very short time. When this happens, it creates an enormous amount of load on the database replicas, leading to wide-ranging performance degradation.
Proposal
The proposed approach can be broken into 2 phases:
- Break down existing unlocking process into smaller scope.
- Reintroduce the fix to unlock pipelines in failed and blocked states.
Break down existing unlocking process into smaller scope
Previously, a single execution of Ci::UnlockArtifactsService
would perform the following in a loop:
- Selecting pipeline IDs that need to be unlocked
- Unlocking these pipeline IDs
- Unlocking job artifacts associated to these pipeline IDs
- Unlocking pipeline artifacts associated to these pipeline IDs
If there are X pipeline IDs, with each pipeline having Y job and pipeline artifacts in total, this service would almost immediately create X * Y amount of updates on the tables.
To mitigate this impact, we could use a limited capacity worker that is rate limited. Each worker would perform updates only on a single pipeline ID and its associated artifacts. The overall rate of change would then be limited by the concurrency of this worker.
When it is determined that a pipeline needs to be unlocked, its ID is put onto a queue. The above worker would consume this queue, unlocking one pipeline at a time.
Reintroduce the fix to unlock pipelines in failed and blocked states.
In the subsequent step, we can reintroduce the fix to unlock the failed and blocked pipeline states. Having done the previous step, the potential spike of pipelines that need unlocking would no longer be a concern, because the actual updates would be done in a controlled manner.
The changes needed to fix the bugs would be limited to identifying the pipeline IDs that need to be unlocked. The fix would then put these pipeline IDs into the same queue implemented in the first phase.
Implementation detail
The following diagram illustrates the proposed 2-phase change.
The existing behaviour is shown in black, where every time Ci::UnlockArtifactsService
is called, it queries for all pipeline IDs that need to be unlocked, then it unlocks the job & pipeline artifacts in all of these pipeline IDs.
The first phase to break down the unlocking workload is shown in green. Ci::UnlockArtifactsService
would enqueue the list of pipeline IDs into a database-backed queue Ci::UnlockPipelineRequest
. A new worker Ci::UnlockPipelineAndArtifactsWorker
will pick a single pipeline off the queue in order to unlock the pipeline and its associated artifacts. This worker uses LimitedCapacity::Worker
to give us the ability to control the rate at which the pipelines are unlocked. This spike MR illustrates this change.
The second phase to fix the bug is shown in blue. With the queue in place, the fix to these issues can directly insert the pipeline IDs into Ci::UnlockPipelineRequest
and reuse the same unlocking process for the artifacts (shown in blue).
This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.