Tune walg upload setttings

Production Change

Change Summary

Per #3962 (comment 529876880) there is a lot of room for tuning the walg upload settings.

We have already (2021-03-16 00:12 UTC) changed:

walg_upload_disk_concurrency: 8 => 1
walg_upload_concurrency: 50 => 8
total_bg_uploaded_limit: undefined (default 32) => 128

This change issue covers some additional tuning of these numbers in an attempt to get our wal-g upload rate up a bit, without affecting the rest of the system.

Changes: 2021-03-16

02:15 UTC: Increase WALG_UPLOAD_CONCURRENCY from 8 to 16 - hoping to increase the throughput again, from what otherwise appears to be a limit that is lower than we had earlier
02:50 UTC: Increase WALG_UPLOAD_CONCURRENCY from 16 to 24 - previous change helped catch-up rate on the uploads, this may help more while load is quiet.
03:15 UTC: Decrease WALG_UPLOAD_CONCURRENCY from 24 to 16; it had not helped. Also WALG_UPLOAD_DISK_CONCURRENCY from 1 to 2 (taking note of #3881 (comment 529876728) and that we may want this higher still; we'll do more in further steps)
03:29 UTC: WALG_UPLOAD_DISK_CONCURRENCY from 2 to 4; previous step from 1 to 2 helped throughput.
03:41 UTC: WALG_UPLOAD_DISK_CONCURRENCY from 4 to 8 (it seems to be helping the delayed replica)
04:05 UTC: Tuned TOTAL_BG_UPLOADED_LIMIT from 128 back to the original 32, to confirm what impact this has on both wal-g upload catch up, and replay on the delayed replica.
04:22 UTC: Restored chef configuration of 1, 8, 128 (as set back at 00:12 UTC)
23:20 UTC: Tuned WALG_UPLOAD_DISK_CONCURRENCY back to 8 (permanently, in chef)
23:22 UTC: Enabled debug level logging to see if there's clues.
23:28 UTC: Disabled debug logging; analysis to follow

Change Details

Services Impacted - ServicePostgres
Change Technician - @cmiskell
Change Criticality - C2
Change Type - changescheduled
Change Reviewer - @dawsmith
Due Date - 2021-03-16
Time tracking - 30 minutes
Downtime Component - 0

Detailed steps for the change

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 30 minutes

Disable chef on patroni-03-db-gprd
Edit /etc/wal-g.d/env/WALG_UPLOAD_CONCURRENCY and change the value to 16
Observe results
Formalize in chef if required, or revert

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

Re-enable and run chef manually.

Monitoring

Key metrics to observe

https://thanos.gitlab.net/graph?g0.range_input=6h&g0.max_source_resolution=0s&g0.expr=delta(pg_archiver_pending_wal_count%7Benv%3D%22gprd%22%2C%20fqdn%3D%22patroni-03-db-gprd.c.gitlab-production.internal%22%7D%5B15m%5D)&g0.tab=0&g1.range_input=2h&g1.max_source_resolution=0s&g1.expr=pg_archiver_pending_wal_count%7Benv%3D%22gprd%22%7D%20%3E0&g1.tab=0

Expecting/hoping that the delta becomes more negative, more like between 23:30 and 00:00 UTC (closer to 2K). If it gets less negative (we are less good at catching up), revert immediately.

Summary of infrastructure changes

Does this change introduce new compute instances?
Does this change re-size any existing compute instances?
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Summary of the above

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and results noted in a comment on this issue.
A dry-run has been conducted and results noted in a comment on this issue.
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
There are currently no active incidents.

Edited Mar 16, 2021 by Craig Miskell

Assignee Loading

Time tracking Loading