Temporarily bypass CI_PRE_CLONE_SCRIPT for gitlab-org/gitlab
Production Change
Change Summary
Part of scalability#931 (closed).
As discussed in scalability#931 (closed) we have packfile cache for gitlab-org/gitlab. The ultimate goal of &372 (closed) is to be able to handle gitlab-org/gitlab CI clone traffic without the existing CI_PRE_CLONE_SCRIPT. We are still experimenting with how the system reacts to the cache and we would like to test if and how the current iteration of the cache can handle gitlab-org/gitlab CI clone traffic.
To do this we will temporarily set the CI "git strategy" of gitlab-org/gitlab to "clone", which causes the runner to undo the work done by CI_PRE_CLONE_SCRIPT and do a regular clone instead.
Change Details
- Services Impacted - ServiceGit ServiceGitaly
- Change Technician - @jacobvosmaer-gitlab
- Change Criticality - C3,
- Change Type - changeunscheduled, changescheduled
- Change Reviewer - DRI for the review of this change
- Due Date - Date and time (in UTC) for the execution of the change
- Time tracking - Time, in minutes, needed to execute all change steps, including rollback
- Downtime Component - If there is a need for downtime, include downtime estimate here
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 5
-
Ensure the cache is enabled in gprd. Both queries should return the same number. -
Notify #quality
and#g_delivery
that we are experimenting with the pre clone script.
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 45
-
Set the CI "git strategy" to "git clone" in the project settings -
Roll back to "git fetch"
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 20
-
Observe metrics
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 5
-
Roll back to "git fetch" in the project settings
Monitoring
Key metrics to observe
-
Metric: Gitaly Apdex, Errors and Saturation
- Location: https://dashboards.gitlab.net/d/gitaly-main/gitaly-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-sigma=2
- What changes to this metric should prompt a rollback: apdex decrease, error increase, saturation increase
-
Metric: Gitaly PackObjectsHook request rate and error rate
- Location: https://dashboards.gitlab.net/d/000000199/gitaly-feature-status?orgId=1&refresh=5s&var-environment=gprd&var-method=PackObjectsHook&var-prometheus=prometheus-01-inf-gprd
- What changes to this metric should prompt a rollback: non zero error rate
-
Metric: Disk write throughput
- Location: custom query
- What changes to this metric should prompt a rollback: notable increase
-
Metric: packfile cache hit rate
- Location: thanos
-
Metric: packfile cache disk usage
- Location: thanos
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Summary of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
There are currently no active incidents.