Tune vfs_cache_pressure tunable on PostgreSQL
Production Change
Change Summary
Provide a high-level summary of the change and its purpose.
Change Details
- Services Impacted - ServicePostgres
- Change Technician - @T4cC0re
- Change Criticality - C1
- Change Type - changeunscheduled, changescheduled
- Change Reviewer - @msmiley
- Due Date - 2021-03-15 20:30 UTC
- Time tracking - Time, in minutes, needed to execute all change steps, including rollback
- Downtime Component - None expected
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Make dir to capture metadata: mkdir /tmp/vfs_cache_pressure_change/
-
Capture slab space usage: sudo slabtop --once > /tmp/vfs_cache_pressure_change/slabtop.$( hostname -s ).$( date +%Y%m%d_%H%M%S_%Z ).out
-
Capture histogram of durations for filesystem operations: sudo /usr/share/bcc/tools/ext4dist 60 1 2> /dev/null > /tmp/vfs_cache_pressure_change/ext4dist.$( hostname -s ).$( date +%Y%m%d_%H%M%S_%Z ).out
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
read only, non-traffic read replica - 5 min
-
ssh patroni-08-db-gprd 'echo 90 | sudo tee /proc/sys/vm/vfs_cache_pressure'
-
monitor the metrics for 5m
traffic read only replicas - 15 min
-
ssh patroni-01-db-gprd 'echo 90 | sudo tee /proc/sys/vm/vfs_cache_pressure'
-
monitor the metrics for 5m -
ssh patroni-01-db-gprd 'echo 90 | sudo tee /proc/sys/vm/vfs_cache_pressure'
-
ssh patroni-02-db-gprd 'echo 90 | sudo tee /proc/sys/vm/vfs_cache_pressure'
-
ssh patroni-04-db-gprd 'echo 90 | sudo tee /proc/sys/vm/vfs_cache_pressure'
-
ssh patroni-05-db-gprd 'echo 90 | sudo tee /proc/sys/vm/vfs_cache_pressure'
-
ssh patroni-06-db-gprd 'echo 90 | sudo tee /proc/sys/vm/vfs_cache_pressure'
-
ssh patroni-07-db-gprd 'echo 90 | sudo tee /proc/sys/vm/vfs_cache_pressure'
-
monitor the metrics for further 10m
production primary
-
ssh patroni-03-db-gprd 'echo 95 | sudo tee /proc/sys/vm/vfs_cache_pressure'
-
monitor the metrics for 5m -
ssh patroni-03-db-gprd 'echo 90 | sudo tee /proc/sys/vm/vfs_cache_pressure'
-
monitor the metrics for 5m
We might consider also turning this to 85
, and 80
later on.
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Capture slab space usage: sudo slabtop --once > /tmp/vfs_cache_pressure_change/slabtop.$( hostname -s ).$( date +%Y%m%d_%H%M%S_%Z ).out
-
Capture histogram of durations for filesystem operations: sudo /usr/share/bcc/tools/ext4dist 60 1 2> /dev/null > /tmp/vfs_cache_pressure_change/ext4dist.$( hostname -s ).$( date +%Y%m%d_%H%M%S_%Z ).out
-
Create a chef MR to make these tunables persistent
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
read only, non-traffic read replica
-
ssh patroni-08-db-gprd 'echo 100 | sudo tee /proc/sys/vm/vfs_cache_pressure'
traffic read only replicas
-
ssh patroni-01-db-gprd 'echo 100 | sudo tee /proc/sys/vm/vfs_cache_pressure'
-
ssh patroni-01-db-gprd 'echo 100 | sudo tee /proc/sys/vm/vfs_cache_pressure'
-
ssh patroni-02-db-gprd 'echo 100 | sudo tee /proc/sys/vm/vfs_cache_pressure'
-
ssh patroni-04-db-gprd 'echo 100 | sudo tee /proc/sys/vm/vfs_cache_pressure'
-
ssh patroni-05-db-gprd 'echo 100 | sudo tee /proc/sys/vm/vfs_cache_pressure'
-
ssh patroni-06-db-gprd 'echo 100 | sudo tee /proc/sys/vm/vfs_cache_pressure'
-
ssh patroni-07-db-gprd 'echo 100 | sudo tee /proc/sys/vm/vfs_cache_pressure'
production primary
-
ssh patroni-03-db-gprd 'echo 100 | sudo tee /proc/sys/vm/vfs_cache_pressure'
Monitoring
Key metrics to observe
- Metric: raw disk IO (adjust for the host you change)
- Location: https://dashboards.gitlab.net/d/pEfSMUhmy/postgresql-disk-io?viewPanel=11&orgId=1&var-environment=gprd&var-prometheus=Global&var-type=&var-node=patroni-03-db-gprd.c.gitlab-production.internal
- What changes to this metric should prompt a rollback: significantly higher latency
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Summary of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
There are currently no active incidents.
Edited by Craig Miskell