Create new gitaly storage shard nodes `file-69-stor-gprd` and `file-70-stor-gprd`

Production Change - Criticality 3 C3

Change Objective Increase capacity for new project repository storage
Change Type Add additional infrastructure instances
Services Impacted ServiceGitaly
Change Team Members @craig
Change Severity ~C3
Buddy check or tested in staging @nnelson @devin
Schedule of the change 2022-03-24 2100UTC/1400PDT
Duration of the change 2 hours
Detailed steps for the change. Each step must include: See below: Summary
Production change requires commented manager approval: N/A (C3 change)

Meta

  • Replace all occurrences of "XX" with the new gitaly shard node number (69).
  • Set the title of this production change issue to: Create new gitaly storage shard nodes file-69-stor-gprd and file-70-stor-gprd
  • Add labels by adding a comment with the following command: /label ~Infrastructure ~C3 ~change ~"requires production access" ~"Service::Gitaly"
  • Replace the first line of the Summary section below as directed.
  • Acquire commented manager approval.

Summary

Originating incident: 2022-03-18: Number of Gitaly shards (for new re... (#6649 - closed)

Detailed steps for the change

The following are the detailed steps for the change.

Note: These steps do not apply to Praefect systems.

Build the new VM instance

  • pre-conditions for execution of the step
  • execution commands for the step
    • Optionally, check quotas before applying the terraform changes.
    % gcloud --project='gitlab-production' compute regions describe us-east1 --format=json | jq -c '.quotas[] | select(.limit > 0) | select(.usage / .limit > 0.5) | { metric, limit, usage }'
    {"metric":"CPUS","limit":15000,"usage":9034}
    {"metric":"DISKS_TOTAL_GB","limit":350000,"usage":253190}
    {"metric":"STATIC_ADDRESSES","limit":400,"usage":282}
    {"metric":"SSD_TOTAL_GB","limit":2500000,"usage":1922534}
    {"metric":"INTERNAL_ADDRESSES","limit":250,"usage":186}
    • Merge the MR.
    • Notify the Engineer On-call about the planned change.
    • Click the apply-to-prod pipeline stage play button.
  • post-execution validation for the step
    • Examine the gprd apply pipeline stage output and confirm the absence of relevant errors.
  • rollback of the step
    • Revert the MR.

Ensure the creation of the storage directory

Once the gitaly nodes are created, it will take a few minutes for chef to run on the system, so they may not be immediately available.

  • pre-conditions for execution of the step
    • Temporarily override the node attribute omnibus-gitlab.package.enable on each of the new nodes so that the gitlab-ee package can be installed by Chef
    # Set `omnibus-gitlab.package.enable` to be `true` for the new nodes 
    for i in 69 70; do
      bundle exec knife node edit file-${i}-stor-gprd.c.gitlab-production.internal
    done
    • Make sure chef-client runs without any errors.
    # file-69
    export node='file-69-stor-gprd.c.gitlab-production.internal'
    bundle exec knife ssh "fqdn:$node" "sudo grep 'Chef Client finished' /var/log/syslog | tail -n 1"
    
    # file-70
    export node='file-70-stor-gprd.c.gitlab-production.internal'
    bundle exec knife ssh "fqdn:$node" "sudo grep 'Chef Client finished' /var/log/syslog | tail -n 1"
  • execution commands for the step
    • If chef does not converge after 5 minutes or so, then invoke it manually. If chef refuses to run, then something is wrong, and this procedure should be rolled-back.
    bundle exec knife ssh "fqdn:$node" "sudo chef-client"
    • Confirm storage directory /var/opt/gitlab/git-data/repositories exists on the file system of the new node.
    bundle exec knife ssh "fqdn:$node" "sudo df -hT /var/opt/gitlab/git-data/repositories && sudo ls -la /var/opt/gitlab/git-data/ && sudo ls -la /var/opt/gitlab/git-data/repositories | head"
    • Remove node attribute overrides once gitlab-ee package has been installed
    # Remove `omnibus-gitlab.package.enable` override on the new nodes 
    for i in 69 70; do
      bundle exec knife node edit file-${i}-stor-gprd.c.gitlab-production.internal
    done
  • post-execution validation for the step
    • Confirm that the gitaly service is running
    bundle exec knife ssh "fqdn:$node" "sudo gitlab-ctl status gitaly"
    • Confirm that there are no relevant errors in the logs.
    bundle exec knife ssh "fqdn:$node" "sudo grep -i 'error' /var/log/gitlab/gitaly/current | tail"
  • rollback of the step
    • No rollback procedure for this step is necessary.
    • This step only confirms and verifies steps taken so far.

Configure the GitLab application so that it is aware of the new nodes

Configure the GitLab application to include the new nodes. Note: The GitLab application will consider the new nodes to be disabled by default.

  • pre-conditions for execution of the step
              "nfs-file69": {
                "path": "/var/opt/gitlab/git-data-file69",
                "gitaly_address": "tcp://file-69-stor-gprd.c.gitlab-production.internal:9999"
              },
              "nfs-file70": {
                "path": "/var/opt/gitlab/git-data-file70",
                "gitaly_address": "tcp://file-70-stor-gprd.c.gitlab-production.internal:9999"
              },
    • Have the MR reviewed by a colleague.
  • execution commands for the step
    • Merge the MR.
    • Notify the Engineer On-call about the planned change.
    • Check the Apply_to_prod ops.gitlab.net pipeline to see if the change successfully applied.
    • Examine the pipeline stage output to verify that there were no errors.
  • post-execution validation for the step
    • Force chef-client to run on the relevant nodes:
    bundle exec knife ssh -C 3 "roles:gprd-base-stor-gitaly-common" "sudo chef-client"
    • Optionally, in another shell session, also force chef-client to run on the relevant nodes. Or else just wait for the nodes to converge naturally.
    bundle exec knife ssh -C 3 "roles:gprd-base-fe OR roles:gprd-base-be" "sudo chef-client"
    • Optionally have chef check for the change:
    $ bundle exec knife role show gprd-base-stor-gitaly-common | egrep -A1 'nfs-file(69|70)'
            name: nfs-file69
            path: /var/opt/gitlab/git-data/repositories
    ---
            name: nfs-file70
            path: /var/opt/gitlab/git-data/repositories
  • rollback of the step
    • Revert the MR.
    • Check the Apply_to_prod ops.gitlab.net pipeline to see if the change successfully applied.
    • Re-run the commands in the post-execution validation for the step

Add the new Gitaly nodes to all our Kubernetes container configuration

  • pre-conditions for execution of the step

      - hostname: gitaly-01-sv-pre.c.gitlab-pre.internal
        name: default
        port: "9999"
        tlsEnabled: false
    • Have the MR reviewed by a colleague in Delivery
  • execution commands for the step

    • Merge the MR.
    • Examine the pipeline stage output to verify that there were no errors.
  • rollback of the step

    • Completing the execution tasks for this step will suffice as a roll-back.

Test the new nodes

Confirm that the new storage nodes are operational.

  • pre-conditions for execution of the step
    • Export your gitlab.com user auth token as an environment variable in your shell session.
    export GITLAB_COM_API_PRIVATE_TOKEN='CHANGEME'
    • Also export your gitlab.com admin user auth token as an environment variable in your shell session.
    export GITLAB_GPRD_ADMIN_API_PRIVATE_TOKEN='CHANGEME'
file-60-stor-gprd

  • execution commands for the step

    • Setup environment
    export project_name='nfs-file69-test'
    export destination_storage_name='nfs-file69'
    • Create a new project:
    rm -f "/tmp/project-${project_name}.json"
    curl --silent --show-error --request POST "https://gitlab.com/api/v4/projects?name=${project_name}&default_branch=main" --header "Private-Token: ${GITLAB_COM_API_PRIVATE_TOKEN}" > "/tmp/project-${project_name}.json"
    export project_id=$(cat "/tmp/project-${project_name}.json" | jq -r '.id')
    export ssh_url_to_repo=$(cat "/tmp/project-${project_name}.json" | jq -r '.ssh_url_to_repo')
    • Clone the project.
    git clone "${ssh_url_to_repo}" "/tmp/${project_name}"
    • Add, commit, and push a README file to the project repository.
    echo "# ${project_name}" > "/tmp/${project_name}/README.md"
    pushd "/tmp/${project_name}" && git add "/tmp/${project_name}/README.md" && git commit -am "Add README" && git push origin main && popd
    export move_id=$(curl --silent --show-error --request POST "https://gitlab.com/api/v4/projects/${project_id}/repository_storage_moves" --data "{\"destination_storage_name\": \"${destination_storage_name}\"}" --header "Private-Token: ${GITLAB_GPRD_ADMIN_API_PRIVATE_TOKEN}" --header 'Content-Type: application/json' | jq -r '.id')
    • Optionally poll the api to monitor the state of the move:
    curl --silent --show-error "https://gitlab.com/api/v4/projects/${project_id}/repository_storage_moves/${move_id}" --header "Private-Token: ${GITLAB_GPRD_ADMIN_API_PRIVATE_TOKEN}" | jq -r '.state'
    • Optionally confirm the new location:
    curl --silent --show-error "https://gitlab.com/api/v4/projects/${project_id}" --header "Private-Token: ${GITLAB_GPRD_ADMIN_API_PRIVATE_TOKEN}" | jq -r '.repository_storage'
    • Once the project has finished being moved to the new shard, proceed to add, commit, and push an update to the README:
    echo -e "\n\ntest" >> "/tmp/${project_name}/README.md"
    pushd "/tmp/${project_name}" && git add "/tmp/${project_name}/README.md" && git commit -am "Update README to test ${destination_storage_name}" && git push origin main && popd
    • Verify that the changes were persisted as expected:
    rm -rf "/tmp/${project_name}"
    git clone "${ssh_url_to_repo}" "/tmp/${project_name}"
    grep 'test' "/tmp/${project_name}/README.md"
    • Once all tests have been completed, delete the test project
    curl --silent --show-error --request DELETE "https://gitlab.com/api/v4/projects/${project_id}" --header "Private-Token: ${GITLAB_GPRD_ADMIN_API_PRIVATE_TOKEN}"
file-70-stor-gprd
  • execution commands for the step

    • Setup environment
    export project_name='nfs-file70-test'
    export destination_storage_name='nfs-file70'
    • Create a new project:
    rm -f "/tmp/project-${project_name}.json"
    curl --silent --show-error --request POST "https://gitlab.com/api/v4/projects?name=${project_name}&default_branch=main" --header "Private-Token: ${GITLAB_COM_API_PRIVATE_TOKEN}" > "/tmp/project-${project_name}.json"
    export project_id=$(cat "/tmp/project-${project_name}.json" | jq -r '.id')
    export ssh_url_to_repo=$(cat "/tmp/project-${project_name}.json" | jq -r '.ssh_url_to_repo')
    • Clone the project.
    git clone "${ssh_url_to_repo}" "/tmp/${project_name}"
    • Add, commit, and push a README file to the project repository.
    echo "# ${project_name}" > "/tmp/${project_name}/README.md"
    pushd "/tmp/${project_name}" && git add "/tmp/${project_name}/README.md" && git commit -am "Add README" && git push origin main && popd
    export move_id=$(curl --silent --show-error --request POST "https://gitlab.com/api/v4/projects/${project_id}/repository_storage_moves" --data "{\"destination_storage_name\": \"${destination_storage_name}\"}" --header "Private-Token: ${GITLAB_GPRD_ADMIN_API_PRIVATE_TOKEN}" --header 'Content-Type: application/json' | jq -r '.id')
    • Optionally poll the api to monitor the state of the move:
    curl --silent --show-error "https://gitlab.com/api/v4/projects/${project_id}/repository_storage_moves/${move_id}" --header "Private-Token: ${GITLAB_GPRD_ADMIN_API_PRIVATE_TOKEN}" | jq -r '.state'
    • Optionally confirm the new location:
    curl --silent --show-error "https://gitlab.com/api/v4/projects/${project_id}" --header "Private-Token: ${GITLAB_GPRD_ADMIN_API_PRIVATE_TOKEN}" | jq -r '.repository_storage'
    • Once the project has finished being moved to the new shard, proceed to add, commit, and push an update to the README:
    echo -e "\n\ntest" >> "/tmp/${project_name}/README.md"
    pushd "/tmp/${project_name}" && git add "/tmp/${project_name}/README.md" && git commit -am "Update README to test ${destination_storage_name}" && git push origin main && popd
    • Verify that the changes were persisted as expected:
    rm -rf "/tmp/${project_name}"
    git clone "${ssh_url_to_repo}" "/tmp/${project_name}"
    grep 'test' "/tmp/${project_name}/README.md"
    • Once all tests have been completed, delete the test project
    curl --silent --show-error --request DELETE "https://gitlab.com/api/v4/projects/${project_id}" --header "Private-Token: ${GITLAB_GPRD_ADMIN_API_PRIVATE_TOKEN}"

Enable the new nodes in Gitlab

Enabling new nodes in the GitLab admin console requires using an admin account to change where new projects are stored. In Admin Area > Settings > Repository > Repository storage > Expand, you will see a list of storage nodes. The ones that are checked are the ones that will receive new projects. For more information see gitlab docs.

  • execution commands for the step
  • post-execution validation for the step
    • Take a count of how many projects are being created on the new shards:
    export node='file-69-stor-gprd.c.gitlab-production.internal'
    bundle exec knife ssh "fqdn:$node" "sudo find /var/opt/gitlab/git-data/repositories/@hashed -mindepth 2 -maxdepth 3 -name *.git | wc -l"
    
    export node='file-70-stor-gprd.c.gitlab-production.internal'
    bundle exec knife ssh "fqdn:$node" "sudo find /var/opt/gitlab/git-data/repositories/@hashed -mindepth 2 -maxdepth 3 -name *.git | wc -l"
    • Observe that this number goes up over time.
Edited by Craig Barrett