Temporarily bypass CI_PRE_CLONE_SCRIPT for gitlab-org/gitlab

Production Change

Change Summary

Part of scalability#931.

As discussed in scalability#931 we have packfile cache for gitlab-org/gitlab. The ultimate goal of &372 (closed) is to be able to handle gitlab-org/gitlab CI clone traffic without the existing CI_PRE_CLONE_SCRIPT. We are still experimenting with how the system reacts to the cache and we would like to test if and how the current iteration of the cache can handle gitlab-org/gitlab CI clone traffic.

To do this we will temporarily set the CI "git strategy" of gitlab-org/gitlab to "clone", which causes the runner to undo the work done by CI_PRE_CLONE_SCRIPT and do a regular clone instead.

Change Details

Services Impacted - ServiceGit ServiceGitaly
Change Technician - @jacobvosmaer-gitlab
Change Criticality - C3,
Change Type - changeunscheduled, changescheduled
Change Reviewer - DRI for the review of this change
Due Date - Date and time (in UTC) for the execution of the change
Time tracking - Time, in minutes, needed to execute all change steps, including rollback
Downtime Component - If there is a need for downtime, include downtime estimate here

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 5

Ensure the cache is enabled in gprd. Both queries should return the same number.
Notify #quality and #g_delivery that we are experimenting with the pre clone script.

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 45

Set the CI "git strategy" to "git clone" in the project settings
Roll back to "git fetch"

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 20

Observe metrics

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 5

Roll back to "git fetch" in the project settings

Monitoring

Key metrics to observe

Metric: Gitaly Apdex, Errors and Saturation
- Location: https://dashboards.gitlab.net/d/gitaly-main/gitaly-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-sigma=2
- What changes to this metric should prompt a rollback: apdex decrease, error increase, saturation increase
Metric: Gitaly PackObjectsHook request rate and error rate
- Location: https://dashboards.gitlab.net/d/000000199/gitaly-feature-status?orgId=1&refresh=5s&var-environment=gprd&var-method=PackObjectsHook&var-prometheus=prometheus-01-inf-gprd
- What changes to this metric should prompt a rollback: non zero error rate
Metric: Disk write throughput
- Location: custom query
- What changes to this metric should prompt a rollback: notable increase
Metric: packfile cache hit rate
- Location: thanos
Metric: packfile cache disk usage
- Location: thanos

Summary of infrastructure changes

Does this change introduce new compute instances?
Does this change re-size any existing compute instances?
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Summary of the above

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and results noted in a comment on this issue.
A dry-run has been conducted and results noted in a comment on this issue.
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
There are currently no active incidents.

Edited Mar 23, 2021 by Jacob Vosmaer

Assignee Loading

Time tracking Loading