Create new gitaly storage shard nodes file-85-stor-gprd to file-89-stor-gprd for storing new projects
Production Change - Criticality 3 C3
| Change Objective | Increase capacity for new project repository storage |
|---|---|
| Change Type | Add additional infrastructure instances |
| Services Impacted | ~Service::gitaly |
| Change Team Members | Username of the engineers involved in the change |
| Change Severity | ~C3 |
| Buddy check or tested in staging | Username of a colleague who will review the change or evidence the change was tested on staging environment |
| Schedule of the change | Date and time in UTC timezone for the execution of the change, if possible add the local timezone of the engineer executing the change |
| Duration of the change | 2 hours |
| Detailed steps for the change. Each step must include: | See below: Summary |
Meta
-
Replace all occurrences of " XX" with the new gitaly shard node number when executing commands. -
Set the title of this production change issue to: Create new gitaly storage shard node file-XX-stor-gprdfor storing new projects
Summary
-
Detailed steps for the change -
Build the new VM instance -
Ensure the creation of the storage directory -
Tell the GitLab application about the new node -
Roll out the new configurations -
Test the new node -
Enable the new node in Gitlab -
Disable the old node in Gitlab
Detailed steps for the change
The following are the detailed steps for the change.
Note: These steps do not apply to Praefect systems.
Build the new VM instance
-
pre-conditions for execution of the step
-
Create a new MR. - The commit should increment the
"node_count" -> "default" -> "multizone-stor"variable setting by the number of new gitlay shards that are being added around line533of the fileenvironments/gprd/variables.tf - Here is an example title and description to use for this MR.
- The commit should increment the
-
Using the new value of the multizone-storfield, change the MR title to: Increment multi-zone storage nodes by [Number of new gitaly shards] to [the new total] -
Link: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/4689 -
Have the MR reviewed by a colleague.
-
-
execution commands for the step
-
Optionally, check quotas before applying the terraform changes. You can check with:
gcloud --project='gitlab-production' compute regions describe us-east1 --format=json | jq -c '.quotas[] | select(.limit > 0) | select(.usage / .limit > 0.5) | { metric, limit, usage }'-
Merge the MR. -
Click the apply gprodpipeline stageplaybutton.
-
-
post-execution validation for the step
-
Examine the gprd applypipeline stage output and confirm the absence of relevant errors.
-
-
rollback of the step
-
Revert the MR.
-
Ensure the creation of the storage directory
Once the gitaly node is created, it will take a few minutes for chef to run on the system, so it may not be immediately available.
-
pre-conditions for execution of the step
-
Make sure chef-clientruns without any errors.
export node='file-XX-stor-gprd.c.gitlab-production.internal' bundle exec knife ssh "fqdn:$node" "sudo grep 'Chef Client finished' /var/log/syslog | tail -n 1" -
-
execution commands for the step
-
If chef does not converge after 10 minutes or so, then invoke it manually. If chef refuses to run, then something is wrong, and this procedure should be rolled-back.
bundle exec knife ssh "fqdn:$node" "sudo chef-client"-
Confirm storage directory /var/opt/gitlab/git-data/repositoriesexists on the file system of the new node.
bundle exec knife ssh "fqdn:$node" "sudo df -hT /var/opt/gitlab/git-data/repositories && sudo ls -la /var/opt/gitlab/git-data/ && sudo ls -la /var/opt/gitlab/git-data/repositories | head" -
-
post-execution validation for the step
-
Confirm that the gitaly service is running
bundle exec knife ssh "fqdn:$node" "sudo gitlab-ctl status gitaly"-
Confirm that there are no relevant errors in the logs.
bundle exec knife ssh "fqdn:$node" "sudo grep -i 'error' /var/log/gitlab/gitaly/current | tail" -
-
rollback of the step
- No rollback procedure for this step is necessary.
- This step only confirms and verifies steps taken so far.
Configure the GitLab application so that it is aware of the new node
Configure the GitLab application to include the new node. Note: The GitLab application will consider the new node to be disabled by default.
-
pre-conditions for execution of the step
-
Create a new MR in the chef-repoproject.- Here is an example title and description to use for this MR.
- The commit should consist of the following changes:
-
Update the override_attributes.omnibus-gitlab.gitaly.storagelist items of fileroles/gprd-base-stor-gitaly-common.json, add item(s) similar to:
{ "name": "nfs-fileXX", "path": "/var/opt/gitlab/git-data/repositories" },-
Update the default_attributes.omnibus-gitlab.gitlab_rb.git_data_dirsmap entry of fileroles/gprd-base.json, add an entry similar to:
"nfs-fileXX": { "path": "/var/opt/gitlab/git-data-fileXX", "gitaly_address": "tcp://file-XX-stor-gprd.c.gitlab-production.internal:9999" },-
Link: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/2654 -
Have the MR reviewed by a colleague.
-
-
execution commands for the step
-
Notify the Engineer On-call about the planned change. -
Create a silence for GitalyServiceGoserverTrafficAbsentSingleNodealert, which will get raised if new Gitaly server(s) do not receive enough traffic for 30 minutes. Reference of alert raised in the past. -
Merge the MR. -
Examine the pipeline stage output for apply_to_prodjob on ops.gitlab.net pipeline to verify that change was applied successfully and there were no errors.
-
-
post-execution validation for the step
-
Verify chef role to check for the change:
$ bundle exec knife role show gprd-base-stor-gitaly-common | grep -A1 'nfs-fileXX' name: nfs-fileXX path: /var/opt/gitlab/git-data/repositories-
Wait 30-35 minutes for the nodes to converge naturally. In the normal circumstances, chef-client periodically runs every 30 (plus upto 5) minutes. Verify by checking node status (ignore patroni/postgres servers in the list):
bundle exec knife status "roles:gprd-base-stor-gitaly-common" --run-list-
Optionally, in case you are running out of patience and thinking explicit run, force chef-clientto run on the relevant nodes (It will take excruciatingly long time though, so better to wait for natural convergence):
bundle exec knife ssh -C 3 "roles:gprd-base-stor-gitaly-common" "sudo chef-client" -
-
rollback of the step
-
Revert the MR. -
Check the apply_to_prodops.gitlab.net pipeline to see if the change successfully applied. -
Re-run the commands in the post-execution validation for the step
-
Add the new Gitaly node to all our Kubernetes container configuration
-
pre-conditions for execution of the step
-
Create a new MR in the gl-infra/k8s-workloads/gitlab-comproject. -
In the MR you want to update the file releases/gitlab/values/${environment}.yaml.gotmpland add the new node to theglobal.gitaly.externalyaml list Typically the data looks like
- hostname: gitaly-01-sv-pre.c.gitlab-pre.internal name: default port: "9999" tlsEnabled: false-
Link: gitlab-com/gl-infra/k8s-workloads/gitlab-com!2408 (merged) -
Have the MR reviewed by a colleague in Delivery
-
-
execution commands for the step
-
Notify the Engineer On-call about the planned change and seek approval, to ensure that no other deployment (From #announcements) is ongoing at the time. -
Merge the MR. -
Examine the pipeline stage output to verify that there were no errors.
-
-
rollback of the step
-
Revert the MR. -
Re-run the execution step for a roll-back.
-
Test the new node
Confirm that the new storage node is operational.
-
pre-conditions for execution of the step
-
Export your gitlab.comuser auth token as an environment variable in your shell session.
export GITLAB_COM_API_PRIVATE_TOKEN='CHANGEME'-
Also export your gitlab.comadmin user auth token as an environment variable in your shell session.
export GITLAB_GPRD_ADMIN_API_PRIVATE_TOKEN='CHANGEME' -
-
execution commands for the step
-
Create a new project:
export project_name='nfs-fileXX-test' rm -f "/tmp/project-${project_name}.json" curl --silent --show-error --request POST "https://gitlab.com/api/v4/projects?name=${project_name}&default_branch=main" --header "Private-Token: ${GITLAB_COM_API_PRIVATE_TOKEN}" > "/tmp/project-${project_name}.json" export project_id=$(cat "/tmp/project-${project_name}.json" | jq -r '.id') export ssh_url_to_repo=$(cat "/tmp/project-${project_name}.json" | jq -r '.ssh_url_to_repo')-
Clone the project.
git clone "${ssh_url_to_repo}" "/tmp/${project_name}"-
Add, commit, and push a READMEfile to the project repository.
echo "# ${project_name}" > "/tmp/${project_name}/README.md" pushd "/tmp/${project_name}" && git add "/tmp/${project_name}/README.md" && git commit -am "Add README" && git push origin main && popd-
Use the API to move it to a new storage server:
export destination_storage_name='nfs-fileXX' export move_id=$(curl --silent --show-error --request POST "https://gitlab.com/api/v4/projects/${project_id}/repository_storage_moves" --data "{\"destination_storage_name\": \"${destination_storage_name}\"}" --header "Private-Token: ${GITLAB_GPRD_ADMIN_API_PRIVATE_TOKEN}" --header 'Content-Type: application/json' | jq -r '.id')-
Optionally poll the api to monitor the state of the move:
curl --silent --show-error "https://gitlab.com/api/v4/projects/${project_id}/repository_storage_moves/${move_id}" --header "Private-Token: ${GITLAB_GPRD_ADMIN_API_PRIVATE_TOKEN}" | jq -r '.state'-
Optionally confirm the new location:
curl --silent --show-error "https://gitlab.com/api/v4/projects/${project_id}" --header "Private-Token: ${GITLAB_GPRD_ADMIN_API_PRIVATE_TOKEN}" | jq -r '.repository_storage'-
Once the project has finished being moved to the new shard, proceed to add, commit, and push an update to the README:
echo -e "\n\ntest" >> "/tmp/${project_name}/README.md" pushd "/tmp/${project_name}" && git add "/tmp/${project_name}/README.md" && git commit -am "Update README to test nfs-fileXX" && git push origin main && popd-
Verify that the changes were persisted as expected:
rm -rf "/tmp/${project_name}" git clone "${ssh_url_to_repo}" "/tmp/${project_name}" grep 'test' "/tmp/${project_name}/README.md" -
Enable the new node in Gitlab
Enabling new nodes in the GitLab admin console requires using an admin account to change where new projects are stored. In Admin Area > Settings > Repository > Repository storage > Expand, you will see a list of storage nodes. The ones that are checked are the ones that will receive new projects. For more information see gitlab docs.
-
execution commands for the step
-
Open a private browser window or tab and navigate to: https://gitlab.com/admin/application_settings/repository -
Click the Expandbutton next toRepository storage. -
Click the Save changesbutton. (I know you didn't do any changes, just trust the process and click the button) -
Click play on the Production gitaly-shard-weights-assigner job to assign a weight.
-
-
post-execution validation for the step
-
Take a count of how many projects are being created on the new shard:
export node='file-XX-stor-gprd.c.gitlab-production.internal' bundle exec knife ssh "fqdn:$node" "sudo find /var/opt/gitlab/git-data/repositories/@hashed -mindepth 2 -maxdepth 3 -name *.git | wc -l"-
Observe that this number goes up over time.
-
-
post-execution validation for the step
-
Take a count of how many projects are being created on the old shard:
export node='file-YY-stor-gprd.c.gitlab-production.internal' bundle exec knife ssh "fqdn:$node" "sudo find /var/opt/gitlab/git-data/repositories/@hashed -mindepth 2 -maxdepth 3 -name *.git | wc -l"-
Observe that this number never goes up over time. (Either goes down or does not change.) -
Delete silence created for GitalyServiceGoserverTrafficAbsentSingleNodealert in steps above.
-