[gprd] Grant `pg_read_all_stats` permission to `postgres_exporter` role
Production Change
Change Summary
[gprd]
Grant pg_read_all_stats
permission to postgres_exporter
role.
Fulfills: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14405
Change Details
- Services Impacted - ServicePostgres
- Change Technician - @nnelson
- Change Reviewer - @cmcfarland
-
Time tracking -
15 minutes
-
Downtime Component -
No downtime
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 1 hour
-
Prepare script with PQSL command: GRANT pg_read_all_stats to postgres_exporter;
-
Test in staging. -
Record results. -
Create MR to add script to https://gitlab.com/gitlab-com/gl-infra/production/-/tree/master/src/gl-infra-5726-grant-pg_read_all_stats-to-postgres_exporter-role
as well as its roll-back script. -
Add link to MR here: https://gitlab.com/gitlab-com/gl-infra/production/-/merge_requests/92
-
Have MR reviewed. -
Determine the leader patroni node: # export GITLAB_ENVIRONMENT='gstg' # For staging testing # export CONSUL_NODE='consul-01-inf-gstg.c.gitlab-staging-1.internal' # For staging testing export GITLAB_ENVIRONMENT='gprd' export CONSUL_NODE='consul-01-inf-gprd.c.gitlab-production.internal' export patroni_node=$(ssh "${CONSUL_NODE}" -- consul catalog nodes -service=patroni -detailed | grep "environment=${GITLAB_ENVIRONMENT}" | sed -n 2p | awk -F' ' '{print $6}' | awk -F', ' '{print $3}' | awk -F'=' '{print $2}') export leader_patroni_node=$(ssh "${patroni_node}" 'sudo /usr/local/bin/gitlab-patronictl list --format json 2>/dev/null' | jq --raw-output '.[] | select(.Role=="Leader").Member') echo "leader: ${leader_patroni_node}"
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 5 minutes
-
Set label changein-progress on this issue -
Clone the production project in the /tmp
directory on the file system of the leader patroni node:ssh "${leader_patroni_node}" '[ -e /tmp/production ] && rm -rf /tmp/production; git clone https://gitlab.com/gitlab-com/gl-infra/production.git /tmp/production'
-
Execute the script in dry-run
mode on the leader:ssh "${leader_patroni_node}" "sudo su --command='DRY_RUN=1 /tmp/production/src/gl-infra-5726-grant-pg_read_all_stats-to-postgres_exporter-role/grant-pg_read_all_stats-to-postgres_exporter-role.sh' root"
-
Record the output of the script as a comment on this issue. -
Verify that there are no errors or other unexpected output. -
Execute the script in wet-run
mode on the leader:ssh "${leader_patroni_node}" "sudo su --command='DRY_RUN=0 /tmp/production/src/gl-infra-5726-grant-pg_read_all_stats-to-postgres_exporter-role/grant-pg_read_all_stats-to-postgres_exporter-role.sh' root"
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 5 minutes
-
Verify that the postgres_exporter
role is now permitted to operate as expected, and that the values forunparseable
declines rapidly and the values forparseable
begin to increase rapidly over multiple refreshes over a few minutes in kibana.
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 5 minutes
-
Execute the roll-back script in dry-run
mode on the leader:ssh "${leader_patroni_node}" "sudo su --command='DRY_RUN=1 /tmp/production/src/gl-infra-5726-grant-pg_read_all_stats-to-postgres_exporter-role/revoke-pg_read_all_stats-from-postgres_exporter-role.sh' root"
-
Record the output of the script as a comment on this issue. -
Verify that there are no errors or other unexpected output. -
Execute the roll-back script in wet-run
mode on the leader:ssh "${leader_patroni_node}" "sudo su --command='DRY_RUN=1 /tmp/production/src/gl-infra-5726-grant-pg_read_all_stats-to-postgres_exporter-role/revoke-pg_read_all_stats-from-postgres_exporter-role.sh' root"
Monitoring
Key metrics to observe
- Metric:
patroni Service Apdex
- Location: https://dashboards.gitlab.net/d/patroni-main/patroni-overview?viewPanel=3543037459&orgId=1&from=now-3h&to=now&refresh=5s
- What changes to this metric should prompt a rollback: Any sustained (more than 2-3 minutes) and significant reduction below 1 hour SLO.
Summary of infrastructure changes
-
Does this change introduce new compute instances? No
-
Does this change re-size any existing compute instances? No
-
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
This Change Issue is linked to the appropriate Issue and/or Epic -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers
and this issue and await their acknowledgment.) -
There are currently no active incidents.
Edited by Nels Nelson