Create new gitaly storage shard nodes file-85-stor-gprd to file-89-stor-gprd for storing new projects

Production Change - Criticality 3 C3

Change Objective Increase capacity for new project repository storage
Change Type Add additional infrastructure instances
Services Impacted ~Service::gitaly
Change Team Members Username of the engineers involved in the change
Change Severity ~C3
Buddy check or tested in staging Username of a colleague who will review the change or evidence the change was tested on staging environment
Schedule of the change Date and time in UTC timezone for the execution of the change, if possible add the local timezone of the engineer executing the change
Duration of the change 2 hours
Detailed steps for the change. Each step must include: See below: Summary

Meta

  • Replace all occurrences of "XX" with the new gitaly shard node number when executing commands.
  • Set the title of this production change issue to: Create new gitaly storage shard node file-XX-stor-gprd for storing new projects

Summary

Detailed steps for the change

The following are the detailed steps for the change.

Note: These steps do not apply to Praefect systems.

Build the new VM instance

  • pre-conditions for execution of the step
  • execution commands for the step
    • Optionally, check quotas before applying the terraform changes. You can check with:
    gcloud --project='gitlab-production' compute regions describe us-east1 --format=json | jq -c '.quotas[] | select(.limit > 0) | select(.usage / .limit > 0.5) | { metric, limit, usage }'
    • Merge the MR.
    • Click the apply gprod pipeline stage play button.
  • post-execution validation for the step
    • Examine the gprd apply pipeline stage output and confirm the absence of relevant errors.
  • rollback of the step
    • Revert the MR.

Ensure the creation of the storage directory

Once the gitaly node is created, it will take a few minutes for chef to run on the system, so it may not be immediately available.

  • pre-conditions for execution of the step
    • Make sure chef-client runs without any errors.
    export node='file-XX-stor-gprd.c.gitlab-production.internal'
    bundle exec knife ssh "fqdn:$node" "sudo grep 'Chef Client finished' /var/log/syslog | tail -n 1"
  • execution commands for the step
    • If chef does not converge after 10 minutes or so, then invoke it manually. If chef refuses to run, then something is wrong, and this procedure should be rolled-back.
    bundle exec knife ssh "fqdn:$node" "sudo chef-client"
    • Confirm storage directory /var/opt/gitlab/git-data/repositories exists on the file system of the new node.
    bundle exec knife ssh "fqdn:$node" "sudo df -hT /var/opt/gitlab/git-data/repositories && sudo ls -la /var/opt/gitlab/git-data/ && sudo ls -la /var/opt/gitlab/git-data/repositories | head"
  • post-execution validation for the step
    • Confirm that the gitaly service is running
    bundle exec knife ssh "fqdn:$node" "sudo gitlab-ctl status gitaly"
    • Confirm that there are no relevant errors in the logs.
    bundle exec knife ssh "fqdn:$node" "sudo grep -i 'error' /var/log/gitlab/gitaly/current | tail"
  • rollback of the step
    • No rollback procedure for this step is necessary.
    • This step only confirms and verifies steps taken so far.

Configure the GitLab application so that it is aware of the new node

Configure the GitLab application to include the new node. Note: The GitLab application will consider the new node to be disabled by default.

  • pre-conditions for execution of the step

  • execution commands for the step

    • Notify the Engineer On-call about the planned change.
    • Create a silence for GitalyServiceGoserverTrafficAbsentSingleNode alert, which will get raised if new Gitaly server(s) do not receive enough traffic for 30 minutes. Reference of alert raised in the past.
    • Merge the MR.
    • Examine the pipeline stage output for apply_to_prod job on ops.gitlab.net pipeline to verify that change was applied successfully and there were no errors.
  • post-execution validation for the step

    • Verify chef role to check for the change:
    $ bundle exec knife role show gprd-base-stor-gitaly-common | grep -A1 'nfs-fileXX'
            name: nfs-fileXX
            path: /var/opt/gitlab/git-data/repositories
    • Wait 30-35 minutes for the nodes to converge naturally. In the normal circumstances, chef-client periodically runs every 30 (plus upto 5) minutes. Verify by checking node status (ignore patroni/postgres servers in the list):
    bundle exec knife status "roles:gprd-base-stor-gitaly-common" --run-list
    • Optionally, in case you are running out of patience and thinking explicit run, force chef-client to run on the relevant nodes (It will take excruciatingly long time though, so better to wait for natural convergence):
    bundle exec knife ssh -C 3 "roles:gprd-base-stor-gitaly-common" "sudo chef-client"
  • rollback of the step

    • Revert the MR.
    • Check the apply_to_prod ops.gitlab.net pipeline to see if the change successfully applied.
    • Re-run the commands in the post-execution validation for the step

Add the new Gitaly node to all our Kubernetes container configuration

  • pre-conditions for execution of the step

      - hostname: gitaly-01-sv-pre.c.gitlab-pre.internal
        name: default
        port: "9999"
        tlsEnabled: false
  • execution commands for the step

    • Notify the Engineer On-call about the planned change and seek approval, to ensure that no other deployment (From #announcements) is ongoing at the time.
    • Merge the MR.
    • Examine the pipeline stage output to verify that there were no errors.
  • rollback of the step

    • Revert the MR.
    • Re-run the execution step for a roll-back.

Test the new node

Confirm that the new storage node is operational.

  • pre-conditions for execution of the step
    • Export your gitlab.com user auth token as an environment variable in your shell session.
    export GITLAB_COM_API_PRIVATE_TOKEN='CHANGEME'
    • Also export your gitlab.com admin user auth token as an environment variable in your shell session.
    export GITLAB_GPRD_ADMIN_API_PRIVATE_TOKEN='CHANGEME'
  • execution commands for the step
    • Create a new project:
    export project_name='nfs-fileXX-test'
    rm -f "/tmp/project-${project_name}.json"
    curl --silent --show-error --request POST "https://gitlab.com/api/v4/projects?name=${project_name}&default_branch=main" --header "Private-Token: ${GITLAB_COM_API_PRIVATE_TOKEN}" > "/tmp/project-${project_name}.json"
    export project_id=$(cat "/tmp/project-${project_name}.json" | jq -r '.id')
    export ssh_url_to_repo=$(cat "/tmp/project-${project_name}.json" | jq -r '.ssh_url_to_repo')
    • Clone the project.
    git clone "${ssh_url_to_repo}" "/tmp/${project_name}"
    • Add, commit, and push a README file to the project repository.
    echo "# ${project_name}" > "/tmp/${project_name}/README.md"
    pushd "/tmp/${project_name}" && git add "/tmp/${project_name}/README.md" && git commit -am "Add README" && git push origin main && popd
    export destination_storage_name='nfs-fileXX'
    export move_id=$(curl --silent --show-error --request POST "https://gitlab.com/api/v4/projects/${project_id}/repository_storage_moves" --data "{\"destination_storage_name\": \"${destination_storage_name}\"}" --header "Private-Token: ${GITLAB_GPRD_ADMIN_API_PRIVATE_TOKEN}" --header 'Content-Type: application/json' | jq -r '.id')
    • Optionally poll the api to monitor the state of the move:
    curl --silent --show-error "https://gitlab.com/api/v4/projects/${project_id}/repository_storage_moves/${move_id}" --header "Private-Token: ${GITLAB_GPRD_ADMIN_API_PRIVATE_TOKEN}" | jq -r '.state'
    • Optionally confirm the new location:
    curl --silent --show-error "https://gitlab.com/api/v4/projects/${project_id}" --header "Private-Token: ${GITLAB_GPRD_ADMIN_API_PRIVATE_TOKEN}" | jq -r '.repository_storage'
    • Once the project has finished being moved to the new shard, proceed to add, commit, and push an update to the README:
    echo -e "\n\ntest" >> "/tmp/${project_name}/README.md"
    pushd "/tmp/${project_name}" && git add "/tmp/${project_name}/README.md" && git commit -am "Update README to test nfs-fileXX" && git push origin main && popd
    • Verify that the changes were persisted as expected:
    rm -rf "/tmp/${project_name}"
    git clone "${ssh_url_to_repo}" "/tmp/${project_name}"
    grep 'test' "/tmp/${project_name}/README.md"

Enable the new node in Gitlab

Enabling new nodes in the GitLab admin console requires using an admin account to change where new projects are stored. In Admin Area > Settings > Repository > Repository storage > Expand, you will see a list of storage nodes. The ones that are checked are the ones that will receive new projects. For more information see gitlab docs.

  • execution commands for the step

  • post-execution validation for the step

    • Take a count of how many projects are being created on the new shard:
    export node='file-XX-stor-gprd.c.gitlab-production.internal'
    bundle exec knife ssh "fqdn:$node" "sudo find /var/opt/gitlab/git-data/repositories/@hashed -mindepth 2 -maxdepth 3 -name *.git | wc -l"
    • Observe that this number goes up over time.
  • post-execution validation for the step

    • Take a count of how many projects are being created on the old shard:
    export node='file-YY-stor-gprd.c.gitlab-production.internal'
    bundle exec knife ssh "fqdn:$node" "sudo find /var/opt/gitlab/git-data/repositories/@hashed -mindepth 2 -maxdepth 3 -name *.git | wc -l"
    • Observe that this number never goes up over time. (Either goes down or does not change.)
    • Delete silence created for GitalyServiceGoserverTrafficAbsentSingleNode alert in steps above.
Edited by Furhan Shabir