Proposal: Detect merge-train failures on canonical merge requests

Context

During a security release, GitLab security and canonical repositories mirroring is turned off to avoid revealing vulnerabilities ahead of the release, during this period changes are being merged to GitLab security and to GitLab canonical creating a disparity between these repositories. To work around this, the merge train is used to keep the security repository updated with changes from canonical by continuously merging canonical code into the security repository.

If the canonical and security code conflict, the execution of merge train fails. Conflict failures are a blocker for deploying regular code changes into GitLab.com and interfere with the release schedule. Often, these code conflicts are discovered close to the security release date and require release managers' involvement to address the problem. This strategy has multiple disadvantages:

Specific domain knowledge required: Solving the conflicts requires knowledge about specific areas of the GitLab codebase, this is not within the scope of the release manager role nor is compatible with the dev-escalation process where an engineer with no knowledge about the conflict might be chosen to help with this.
Incompatible with the current customer needs. Delivery wants to support multiple and combined (security and patch) releases per month to address customer needs; considering the workload it times to fix the merge conflicts, dealing with these failures on a weekly basis interferes with this plan.
Interferes with GitLab.com deployments: Deployments to GitLab.com use the GitLab security repository as a SSOT. During a security release, this repository is updated with the GitLab canonical code via merge train, when the merge train fails, the GitLab security repository stops being updated which impedes regular code changes from being deployed to GitLab.com until the merge train is unblocked.
Time-consuming process: Fixing the merge conflicts is an extended and time-consuming task: Expensive time is spent finding a domain expert, creating a merge request, having it approved and ready to be merged (previous attempts have taken around 19hrs to mitigate).

Security release	# conflicts	GitLab.com deployments blocked for
https://gitlab.com/gitlab-org/release/tasks/-/issues/6098	2	10hrs
https://gitlab.com/gitlab-org/release/tasks/-/issues/5911	3	17hrs
https://gitlab.com/gitlab-org/release/tasks/-/issues/5743	1	6hrs
https://gitlab.com/gitlab-org/release/tasks/-/issues/6330	1	6hrs
production#17703 (closed)	1	14hrs

Breakdown

Security release	MR	# GitLab.com deployments blocked for
https://gitlab.com/gitlab-org/release/tasks/-/issues/6098		10hrs
	https://gitlab.com/gitlab-org/security/gitlab/-/merge_requests/3452	4hrs
	https://gitlab.com/gitlab-org/security/gitlab/-/merge_requests/3460	3hrs
	https://gitlab.com/gitlab-org/security/gitlab/-/merge_requests/3463	3hrs
https://gitlab.com/gitlab-org/release/tasks/-/issues/5911		17hrs
	https://gitlab.com/gitlab-org/security/gitlab/-/merge_requests/3371	8hrs
	https://gitlab.com/gitlab-org/security/gitlab/-/merge_requests/3370	6hrs
	https://gitlab.com/gitlab-org/security/gitlab/-/merge_requests/3369	3hrs
https://gitlab.com/gitlab-org/release/tasks/-/issues/5743		6hrs
	https://gitlab.com/gitlab-org/security/gitlab/-/merge_requests/3176	6hrs
https://gitlab.com/gitlab-org/release/tasks/-/issues/6330		6hrs
	https://gitlab.com/gitlab-org/security/gitlab/-/merge_requests/3523	6hrs
production#17703 (closed)		14hrs
	https://gitlab.com/gitlab-org/security/gitlab/-/merge_requests/3920	14hrs

Risk the release and deployment schedule: The security release is a time-sensitive process, additional time release managers spend fixing the conflicts puts at risk the release of the security fixes to self-managed customers.

To reduce merge-train failures the code incompatibility should be discovered during the development stage.

Proposal: Detect code conflicts during the development cycle of canonical merge requests

Detect code conflicts during the development cycle of canonical merge requests by having a dedicated CI job on the merge request pipelines that verifies the compatibility between the GitLab canonical branch and the code running on the GitLab security default branch, if an incompatibility is found, the merge of the canonical merge request should be prevented.

Engineer workflow:

During a security release, an engineer creates a merge request on GitLab canonical.
The merge request follows the code review process and is approved by a GitLab maintainer
A pipeline is triggered on the merge request as part of the development cycle.
A job on this pipeline triggers a pipeline on the GitLab Security repository.
A pipeline is triggered on GitLab security to verify:
- If there is a security release in progress, and,
- If the code from the GitLab branch is compatible with the security code.
There are two options to corroborate if there is a security release in progress:
- Check the mirroring repositories status or,
- Check if the repositories are out of sync via Git.
To verify if the canonical code is compatible with the security code:
1. The canonical branch will be merged into the security default branch.
2. If the merge succeeds, the execution is stopped.
3. If the merge fails, proceed to the next step.
If an incompatibility is found:
1. A merge request is opened on GitLab security and assigned to the merge request author. By default, the merge request should highlight the conflict via merge conflicts.
2. The job log fails indicating to the user the merge request can't be merged.
At this point, to move along the canonical merge request the Engineer has two options:
1. Wait for the security release to be completed and the repositories to be synced, or,
2. Take the extra step and fix the conflict on the security merge request, and get the security merge request merged.

Considerations to prevent leaking any security detail:

The code verification is happening in a private repository limited to GitLab team members
The verification should only happen if a security release is in progress.
The job should only be triggered when a GitLab team member has approved the merge request (not on forks).

Pros

Reduces merge train failures by detecting code incompatibilities during the development cycle.
Supports FedRAMP SLAs for security fixes by prioritizing security fixes over canonical changes.
Switch the responsibility of fixing code conflicts from release managers to engineers with domain expertise in the GitLab codebase.
Supports the update of the GitLab security repository with minimum intervention.
Prevents halting GitLab.com deployments and allows the release schedule to stay on course.
Dogfoods the merge conflict tool resolution from GitLab

Cons

It could impact the velocity of canonical changes. This risk should be mitigated once we have multiple and combined releases per month.
Since community contributors don’t have access to the GitLab security repository it could impact the velocity of community contributions if a conflict is found in a community merge request.

An alternative proposal to detect conflicts on security merge requests was analyzed as well but it was disregarded because:

Prioritizes canonical fixes over security ones impacting FedRAMP SLA
The strategy is not efficient due to the high volume of git traffic the GitLab canonical repo receives, which could lead to detecting conflicts at merge time but not at merge-train

Details of this proposal are below.

Proposal: Detect code conflicts on security merge requests

Proposal: Detect code conflicts on security merge requests.

Like the above proposal, a similar strategy can be used to detect conflicts during the development cycle of security merge requests: A job can be added to security merge requests pipelines to verify the compatibility between the code in a security branch and existing code on the GitLab canonical default branch.

Developer workflow:

An engineer works on a security fix in the GitLab security repository.
A pipeline is triggered on the merge request as part of the development cycle.
A non-allowed-to-fail job on this pipeline verifies if the security code is compatible with the code running on the GitLab canonical default branch by
1. Cloning the GitLab canonical repository.
2. Attempting to merge the security branch into the canonical default branch.
3. If the merge succeeds, the job succeeds
4. If the merge fails, proceed to the next step.
If the code is incompatible:
1. The CI job fails
2. A note is created on the merge request alerting the author.

Pros

Similar to the first proposal:
- Reduces merge train failures by detecting code incompatibilities during the development cycle.
- Switch the responsibility of fixing code conflicts from release managers to engineers with domain expertise in the GitLab codebase.
- Supports the update of the GitLab security repository with minimum intervention.
- Prevents halting GitLab.com deployments and allows the release schedule to stay on course.
Safe from the security perspective all the logic and involvement happens on a security release which is limited to GitLab team members

Cons

It could delay the release of a high-priority security fix if it is incompatible with the GitLab canonical code.
Impacts the FedRAMP SLA by prioritizing the canonical changes instead of security fixes.
It is not efficient, the git traffic coming on GitLab canonical is higher than the git traffic on GitLab security, so possible conflicts are detected at merge time but not at merge train time.
Increases administrative security tasks for GitLab engineers: Fixing a job failure on a security merge request requires requesting additional code owner approvals and waiting for long pipelines.

Edited Mar 05, 2024 by Mayra Cabrera