Tune walg upload setttings
Production Change
Change Summary
Per #3962 (comment 529876880) there is a lot of room for tuning the walg upload settings.
We have already (2021-03-16 00:12 UTC) changed:
- walg_upload_disk_concurrency: 8 => 1
- walg_upload_concurrency: 50 => 8
- total_bg_uploaded_limit: undefined (default 32) => 128
This change issue covers some additional tuning of these numbers in an attempt to get our wal-g upload rate up a bit, without affecting the rest of the system.
Changes: 2021-03-16
- 02:15 UTC: Increase WALG_UPLOAD_CONCURRENCY from 8 to 16 - hoping to increase the throughput again, from what otherwise appears to be a limit that is lower than we had earlier
- 02:50 UTC: Increase WALG_UPLOAD_CONCURRENCY from 16 to 24 - previous change helped catch-up rate on the uploads, this may help more while load is quiet.
- 03:15 UTC: Decrease WALG_UPLOAD_CONCURRENCY from 24 to 16; it had not helped. Also WALG_UPLOAD_DISK_CONCURRENCY from 1 to 2 (taking note of #3881 (comment 529876728) and that we may want this higher still; we'll do more in further steps)
- 03:29 UTC: WALG_UPLOAD_DISK_CONCURRENCY from 2 to 4; previous step from 1 to 2 helped throughput.
- 03:41 UTC: WALG_UPLOAD_DISK_CONCURRENCY from 4 to 8 (it seems to be helping the delayed replica)
- 04:05 UTC: Tuned TOTAL_BG_UPLOADED_LIMIT from 128 back to the original 32, to confirm what impact this has on both wal-g upload catch up, and replay on the delayed replica.
- 04:22 UTC: Restored chef configuration of 1, 8, 128 (as set back at 00:12 UTC)
- 23:20 UTC: Tuned WALG_UPLOAD_DISK_CONCURRENCY back to 8 (permanently, in chef)
- 23:22 UTC: Enabled debug level logging to see if there's clues.
- 23:28 UTC: Disabled debug logging; analysis to follow
Change Details
- Services Impacted - ServicePostgres
- Change Technician - @cmiskell
- Change Criticality - C2
- Change Type - changescheduled
- Change Reviewer - @dawsmith
- Due Date - 2021-03-16
- Time tracking - 30 minutes
- Downtime Component - 0
Detailed steps for the change
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 30 minutes
-
Disable chef on patroni-03-db-gprd -
Edit /etc/wal-g.d/env/WALG_UPLOAD_CONCURRENCY and change the value to 16 -
Observe results -
Formalize in chef if required, or revert
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Re-enable and run chef manually.
Monitoring
Key metrics to observe
Expecting/hoping that the delta becomes more negative, more like between 23:30 and 00:00 UTC (closer to 2K). If it gets less negative (we are less good at catching up), revert immediately.
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Summary of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
There are currently no active incidents.
Edited by Craig Miskell