Skip to content

Remove --no-tags from GIT_FETCH_EXTRA_FLAGS in gitlab-org/gitlab CI settings

Production Change

Change Summary

We want to make a CI configuration change in the project settings of gitlab-org/gitlab. The idea is to undo #3278 (closed) and remove the --no-tags option.

As part of &400 (closed) we discovered an inefficiency in the way Git serves fetch traffic that is particularly visible on gitlab-org/gitlab, because that repo has many refs and it sees many fetches from CI. In #3278 (closed) we made a CI config change that caused CI runners to run a different git fetch command which reduced pressure on the Gitaly server file-cny-01 that hosts gitlab-org/gitlab.

Since then, we have identified the cause of the performance problem, and worked with the Git maintainers to fix this particular performance problem in Git itself. These Git performance fixes are in production now so we should no longer need --no-tags. Removing that config setting would prove that this is indeed the case.

Change Details

  1. Services Impacted - ServiceGit ServiceGitaly
  2. Change Technician - @jacobvosmaer-gitlab
  3. Change Criticality - C3,
  4. Change Type - changescheduled
  5. Change Reviewer - @ahmadsherif
  6. Due Date - 2020-03-03 16:00 UTC
  7. Time tracking - 15 minutes
  8. Downtime Component - N/A

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 10 minutes

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 1 minute

  • As an admin, browse to https://gitlab.com/gitlab-org/gitlab/-/settings/ci_cd
  • Expand the Variables section, and click the edit button on GIT_FETCH_EXTRA_FLAGS
  • Change the value from --prune --progress --no-tags to --prune --progress
    • If the current value is not exactly --prune --progress --no-tags before the change, verifying whether this change is still valid (and consider stopping)
  • Click "Update variable"

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 10 minutes

  • Monitor per monitoring section; the change is expected to provide performance improvements, but otherwise no change to behavior.

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 1 minute

  • Repeat the change, reverting to the original value of GIT_FETCH_EXTRA_FLAGS

Monitoring

Key metrics to observe

  • Metric: gitlab_runner_failed_jobs_total
    • Location: Thanos
    • What changes to this metric should prompt a rollback: A noticeable rise; there is some low grade/low-rate of failed jobs, but a noticeable increase may suggest this change was problematic.
  • Metric: CPU usage on file-cny-01
    • Location: Host stats dashboard
    • What changes to this metric should prompt a rollback: A noticeable rise. Unfortunately the base metric is bursty, so it would require a substantial and obvious change. We're hoping this will go down, but it may be hard to identify this in the short term; it will be more visible over several hours, so we're just initially monitoring for degradation.
  • Metric: gitaly apdex/errors
  • Metric: git-upload-pack CPU usage
    • Location: git-upload-pack CPU usage on file-cny-01.
    • What changes to this metric should prompt a rollback: A rise; we actually expect it to go down, although failing to go down would not be sufficient reason to rollback immediately.

Summary of infrastructure changes

  • Does this change introduce new compute instances?
  • Does this change re-size any existing compute instances?
  • Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Summary of the above

Changes checklist

  • This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
  • This issue has the change technician as the assignee.
  • Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
  • Necessary approvals have been completed based on the Change Management Workflow.
  • Change has been tested in staging and results noted in a comment on this issue.
  • A dry-run has been conducted and results noted in a comment on this issue.
  • SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
  • There are currently no active incidents.
Edited by Jacob Vosmaer