Tune vfs_cache_pressure tunable on PostgreSQL

Production Change

Change Summary

Provide a high-level summary of the change and its purpose.

Change Details

Services Impacted - ServicePostgres
Change Technician - @T4cC0re
Change Criticality - C1
Change Type - changeunscheduled, changescheduled
Change Reviewer - @msmiley
Due Date - 2021-03-15 20:30 UTC
Time tracking - Time, in minutes, needed to execute all change steps, including rollback
Downtime Component - None expected

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

Make dir to capture metadata: mkdir /tmp/vfs_cache_pressure_change/
Capture slab space usage: sudo slabtop --once > /tmp/vfs_cache_pressure_change/slabtop.$( hostname -s ).$( date +%Y%m%d_%H%M%S_%Z ).out
Capture histogram of durations for filesystem operations: sudo /usr/share/bcc/tools/ext4dist 60 1 2> /dev/null > /tmp/vfs_cache_pressure_change/ext4dist.$( hostname -s ).$( date +%Y%m%d_%H%M%S_%Z ).out

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

read only, non-traffic read replica - 5 min

ssh patroni-08-db-gprd 'echo 90 | sudo tee /proc/sys/vm/vfs_cache_pressure'
monitor the metrics for 5m

traffic read only replicas - 15 min

ssh patroni-01-db-gprd 'echo 90 | sudo tee /proc/sys/vm/vfs_cache_pressure'
monitor the metrics for 5m
ssh patroni-01-db-gprd 'echo 90 | sudo tee /proc/sys/vm/vfs_cache_pressure'
ssh patroni-02-db-gprd 'echo 90 | sudo tee /proc/sys/vm/vfs_cache_pressure'
ssh patroni-04-db-gprd 'echo 90 | sudo tee /proc/sys/vm/vfs_cache_pressure'
ssh patroni-05-db-gprd 'echo 90 | sudo tee /proc/sys/vm/vfs_cache_pressure'
ssh patroni-06-db-gprd 'echo 90 | sudo tee /proc/sys/vm/vfs_cache_pressure'
ssh patroni-07-db-gprd 'echo 90 | sudo tee /proc/sys/vm/vfs_cache_pressure'
monitor the metrics for further 10m

production primary

ssh patroni-03-db-gprd 'echo 95 | sudo tee /proc/sys/vm/vfs_cache_pressure'
monitor the metrics for 5m
ssh patroni-03-db-gprd 'echo 90 | sudo tee /proc/sys/vm/vfs_cache_pressure'
monitor the metrics for 5m

We might consider also turning this to 85, and 80 later on.

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

Capture slab space usage: sudo slabtop --once > /tmp/vfs_cache_pressure_change/slabtop.$( hostname -s ).$( date +%Y%m%d_%H%M%S_%Z ).out
Capture histogram of durations for filesystem operations: sudo /usr/share/bcc/tools/ext4dist 60 1 2> /dev/null > /tmp/vfs_cache_pressure_change/ext4dist.$( hostname -s ).$( date +%Y%m%d_%H%M%S_%Z ).out
Create a chef MR to make these tunables persistent

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

read only, non-traffic read replica

ssh patroni-08-db-gprd 'echo 100 | sudo tee /proc/sys/vm/vfs_cache_pressure'

traffic read only replicas

ssh patroni-01-db-gprd 'echo 100 | sudo tee /proc/sys/vm/vfs_cache_pressure'
ssh patroni-01-db-gprd 'echo 100 | sudo tee /proc/sys/vm/vfs_cache_pressure'
ssh patroni-02-db-gprd 'echo 100 | sudo tee /proc/sys/vm/vfs_cache_pressure'
ssh patroni-04-db-gprd 'echo 100 | sudo tee /proc/sys/vm/vfs_cache_pressure'
ssh patroni-05-db-gprd 'echo 100 | sudo tee /proc/sys/vm/vfs_cache_pressure'
ssh patroni-06-db-gprd 'echo 100 | sudo tee /proc/sys/vm/vfs_cache_pressure'
ssh patroni-07-db-gprd 'echo 100 | sudo tee /proc/sys/vm/vfs_cache_pressure'

production primary

ssh patroni-03-db-gprd 'echo 100 | sudo tee /proc/sys/vm/vfs_cache_pressure'

Monitoring

Key metrics to observe

Metric: raw disk IO (adjust for the host you change)
- Location: https://dashboards.gitlab.net/d/pEfSMUhmy/postgresql-disk-io?viewPanel=11&orgId=1&var-environment=gprd&var-prometheus=Global&var-type=&var-node=patroni-03-db-gprd.c.gitlab-production.internal
- What changes to this metric should prompt a rollback: significantly higher latency

Summary of infrastructure changes

Does this change introduce new compute instances?
Does this change re-size any existing compute instances?
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Summary of the above

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and results noted in a comment on this issue.
A dry-run has been conducted and results noted in a comment on this issue.
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
There are currently no active incidents.

Edited Mar 17, 2021 by Craig Miskell

Assignee Loading

Time tracking Loading