Skip to content

Tune walg upload setttings

Production Change

Change Summary

Per #3962 (comment 529876880) there is a lot of room for tuning the walg upload settings.

We have already (2021-03-16 00:12 UTC) changed:

  • walg_upload_disk_concurrency: 8 => 1
  • walg_upload_concurrency: 50 => 8
  • total_bg_uploaded_limit: undefined (default 32) => 128

This change issue covers some additional tuning of these numbers in an attempt to get our wal-g upload rate up a bit, without affecting the rest of the system.

Changes: 2021-03-16

  • 02:15 UTC: Increase WALG_UPLOAD_CONCURRENCY from 8 to 16 - hoping to increase the throughput again, from what otherwise appears to be a limit that is lower than we had earlier
  • 02:50 UTC: Increase WALG_UPLOAD_CONCURRENCY from 16 to 24 - previous change helped catch-up rate on the uploads, this may help more while load is quiet.
  • 03:15 UTC: Decrease WALG_UPLOAD_CONCURRENCY from 24 to 16; it had not helped. Also WALG_UPLOAD_DISK_CONCURRENCY from 1 to 2 (taking note of #3881 (comment 529876728) and that we may want this higher still; we'll do more in further steps)
  • 03:29 UTC: WALG_UPLOAD_DISK_CONCURRENCY from 2 to 4; previous step from 1 to 2 helped throughput.
  • 03:41 UTC: WALG_UPLOAD_DISK_CONCURRENCY from 4 to 8 (it seems to be helping the delayed replica)
  • 04:05 UTC: Tuned TOTAL_BG_UPLOADED_LIMIT from 128 back to the original 32, to confirm what impact this has on both wal-g upload catch up, and replay on the delayed replica.
  • 04:22 UTC: Restored chef configuration of 1, 8, 128 (as set back at 00:12 UTC)
  • 23:20 UTC: Tuned WALG_UPLOAD_DISK_CONCURRENCY back to 8 (permanently, in chef)
  • 23:22 UTC: Enabled debug level logging to see if there's clues.
  • 23:28 UTC: Disabled debug logging; analysis to follow

Change Details

  1. Services Impacted - ServicePostgres
  2. Change Technician - @cmiskell
  3. Change Criticality - C2
  4. Change Type - changescheduled
  5. Change Reviewer - @dawsmith
  6. Due Date - 2021-03-16
  7. Time tracking - 30 minutes
  8. Downtime Component - 0

Detailed steps for the change

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 30 minutes

  • Disable chef on patroni-03-db-gprd
  • Edit /etc/wal-g.d/env/WALG_UPLOAD_CONCURRENCY and change the value to 16
  • Observe results
  • Formalize in chef if required, or revert

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

  • Re-enable and run chef manually.

Monitoring

Key metrics to observe

https://thanos.gitlab.net/graph?g0.range_input=6h&g0.max_source_resolution=0s&g0.expr=delta(pg_archiver_pending_wal_count%7Benv%3D%22gprd%22%2C%20fqdn%3D%22patroni-03-db-gprd.c.gitlab-production.internal%22%7D%5B15m%5D)&g0.tab=0&g1.range_input=2h&g1.max_source_resolution=0s&g1.expr=pg_archiver_pending_wal_count%7Benv%3D%22gprd%22%7D%20%3E0&g1.tab=0

Expecting/hoping that the delta becomes more negative, more like between 23:30 and 00:00 UTC (closer to 2K). If it gets less negative (we are less good at catching up), revert immediately.

Summary of infrastructure changes

  • Does this change introduce new compute instances?
  • Does this change re-size any existing compute instances?
  • Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Summary of the above

Changes checklist

  • This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
  • This issue has the change technician as the assignee.
  • Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
  • Necessary approvals have been completed based on the Change Management Workflow.
  • Change has been tested in staging and results noted in a comment on this issue.
  • A dry-run has been conducted and results noted in a comment on this issue.
  • SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
  • There are currently no active incidents.
Edited by Craig Miskell