Enable Gitaly packfile cache for gitlab-com/www-gitlab-com
Production Change
Change Summary
Part of scalability#931 (closed).
In &372 (closed) we are developing a caching mechanism in Gitaly meant to reduce CPU/RAM consumption due to massively parallel CI Git fetch workloads. A first iteration of this cache has been deployed behind a feature flag. The next step is for us to investigate the real world impact of the cache, and to see if we need to iterate more on its design.
As discussed in scalability#931 (closed) we have turned this flag on for various projects already. We now want to turn it on for gitlab-com/www-gitlab-com.
Change Details
- Services Impacted - ServiceGit ServiceGitaly
- Change Technician - @jacobvosmaer-gitlab
- Change Criticality - C3,
- Change Type - changeunscheduled, changescheduled
- Change Reviewer - DRI for the review of this change
- Due Date - Date and time (in UTC) for the execution of the change
- Time tracking - Time, in minutes, needed to execute all change steps, including rollback
- Downtime Component - If there is a need for downtime, include downtime estimate here
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 5
-
Ensure the cache is enabled in gprd. Both queries should return the same number. -
Notify #whats-happening-at-gitlab
that we are toggling a feature flag on gitlab-com/www-gitlab-com
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 20
-
/chatops run feature set gitaly_upload_pack_gitaly_hooks true --project gitlab-com/www-gitlab-com
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 20
-
git clone --bare --depth=1 https://gitlab.com/gitlab-com/www-gitlab-com.git
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 5
-
/chatops run feature delete gitaly_upload_pack_gitaly_hooks
Monitoring
Key metrics to observe
-
Metric: Gitaly Apdex, Errors and Saturation
- Location: https://dashboards.gitlab.net/d/gitaly-main/gitaly-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-sigma=2
- What changes to this metric should prompt a rollback: apdex decrease, error increase, saturation increase
-
Metric: Gitaly PackObjectsHook request rate and error rate
- Location: https://dashboards.gitlab.net/d/000000199/gitaly-feature-status?orgId=1&refresh=5s&var-environment=gprd&var-method=PackObjectsHook&var-prometheus=prometheus-01-inf-gprd
- What changes to this metric should prompt a rollback: non zero error rate
-
Metric: Disk space utilization
- Location: custom query
- What changes to this metric should prompt a rollback: notable increase
-
Metric: Disk write throughput
- Location: custom query
- What changes to this metric should prompt a rollback: ??
-
Metric: packfile cache hit rate
- Location: thanos
-
Metric: packfile cache disk usage
- Location: thanos
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Summary of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
There are currently no active incidents.