[gprd] Enable `pg_stat_statements` and `pg_stat_activity` `fluentd` plugin on a single `patroni` node (the backup node)
Production Change
Change Summary
Enable pg_stat_statements
and pg_stat_activity
fluentd
plugin on a single patroni
node (the backup node).
Fulfills: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13697
Change Details
- Services Impacted - ServicePatroni ServiceMonitoring
- Change Technician - @nnelson
- Change Reviewer - @cmcfarland
-
Time tracking -
~30 minutes
-
Downtime Component -
No downtime
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 5 minutes
-
Have the script to update the node attributes reviewed: !85 (closed) -
Create a change issue to revert this node-level change, and record it here: #5394 (closed) -
Clone the project containing the support script: mkdir -p gl-infra cd gl-infra [[ -d production ]] || git clone https://gitlab.com/gitlab-com/gl-infra/production.git cd - cd gl-infra/production git fetch origin git stash git checkout master git pull origin master cd - cd chef-repo mkdir -p tmp ln -s ../../gl-infra/production tmp/
-
Prepare some environment variables: export GITLAB_ENVIRONMENT='gprd' export patroni_node=$(bundle exec knife search node "roles:${GITLAB_ENVIRONMENT}-base-db-patroni" --format=json | jq --raw-output '.rows|=sort_by(.name)|.rows[0]|.name') export backup_patroni_node=$(ssh "${patroni_node}" 'sudo /usr/local/bin/gitlab-patronictl list --format json 2>/dev/null' | jq --raw-output '[.[] | select((.Tags["nofailover"]==true) and (.Tags["noloadbalance"]==true))] | sort_by(.Member) | .[-1].Member') echo "backup patroni node: ${backup_patroni_node}" export TARGET_NODE="${backup_patroni_node}" # Required environmental parameter for tmp/production/src/gl-infra-5393-node-level-attribute-modifier-chef-scripts/*.rb echo $TARGET_NODE export DRY_RUN="yes" # Optional environmental parameter for tmp/production/src/gl-infra-5393-node-level-attribute-modifier-chef-scripts/delete_node_level_attribute.rb*.rb echo $DRY_RUN
-
The patroni-v12-10-db-gprd.c.gitlab-production.internal
node should be configured as the stenographer of the primary, but such a configuration is subject to change and may not be the case at the time of invocation. Double-check the value of$backup_patroni_node
. Ensure that this value is recorded as a comment for later reference.
Change Steps - steps to take to execute the change
Estimated Time to Complete - 10 minutes
-
Set label changein-progress on this issue -
Dry-run persist the changes to the node-level attribute state: bundle exec knife exec --verbose --verbose tmp/production/src/gl-infra-5393-node-level-attribute-modifier-chef-scripts/set_node_level_attribute.rb
-
Wet-run persist the changes to the node-level attribute state: export DRY_RUN='no' bundle exec knife exec --verbose --verbose tmp/production/src/gl-infra-5393-node-level-attribute-modifier-chef-scripts/set_node_level_attribute.rb
-
Verify that the new node-level attribute state includes the expected node-level normal attribute section: bundle exec knife node show "${backup_patroni_node}" --long --format=json | jq '.["normal"]["gitlab_fluentd"]' | grep --color=always --extended-regexp '(postgres_pg_stat_statements_enable|postgres_pg_stat_activity_enable)'
-
Confirm that this resembles: "postgres_pg_stat_statements_enable": true, "postgres_pg_stat_activity_enable": true,
-
Run chef-client
on the node and capture its output:bundle exec knife ssh "fqdn:${backup_patroni_node}" 'sudo chef-client 2>&1 | tee /dev/tty | logger' | grep --ignore-case 'error'
-
Capture and record any chef-client
errors in a comment within a<details>...</details>
html block. -
Verify that the new node-level attribute state still includes the expected node-level normal attribute section: bundle exec knife node show "${backup_patroni_node}" --long --format=json | jq '.["normal"]["gitlab_fluentd"]' | grep --color=always --extended-regexp '(postgres_pg_stat_statements_enable|postgres_pg_stat_activity_enable)'
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete - 10 minutes
-
Execute: bundle exec knife node show "${backup_patroni_node}" --long --format=json | jq '.["normal"]["gitlab_fluentd"]' | grep --color=always --extended-regexp '(postgres_pg_stat_statements_enable|postgres_pg_stat_activity_enable)'
-
Verify that the output resembles: "postgres_pg_stat_statements_enable": true, "postgres_pg_stat_activity_enable": true, },
-
Inspect the td-agent postgres source configuration by invoking the following command: bundle exec knife ssh "roles:${GITLAB_ENVIRONMENT}-base-db-patroni" 'grep --before=1 --after=9 "@type pg_stat_statements" /etc/td-agent/conf.d/postgres.conf'
-
Verify that the output resembles: patroni-v12-10-db-gprd.c.gitlab-production.internal <source> patroni-v12-10-db-gprd.c.gitlab-production.internal @type pg_stat_statements patroni-v12-10-db-gprd.c.gitlab-production.internal tag postgres.pg_stat_statements patroni-v12-10-db-gprd.c.gitlab-production.internal host localhost patroni-v12-10-db-gprd.c.gitlab-production.internal port 5432 patroni-v12-10-db-gprd.c.gitlab-production.internal username "#{ENV['FLUENTD_POSTGRES_INPUT_USERNAME']}" patroni-v12-10-db-gprd.c.gitlab-production.internal password "#{ENV['FLUENTD_POSTGRES_INPUT_PASSWORD']}" patroni-v12-10-db-gprd.c.gitlab-production.internal dbname gitlabhq_production patroni-v12-10-db-gprd.c.gitlab-production.internal sslmode prefer patroni-v12-10-db-gprd.c.gitlab-production.internal interval 300 patroni-v12-10-db-gprd.c.gitlab-production.internal </source>
-
Set a reasonable due date on the production change issue created to track the restoration of the node-level attributes for this patroni node pet to its cattle state.
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete - 2 minutes
-
Simply restore the node's pet state to its cattle state: bundle exec knife exec --verbose --verbose tmp/production/src/gl-infra-5393-node-level-attribute-modifier-chef-scripts/delete_node_level_attribute.rb
-
Run chef-client
on the node:bundle exec knife ssh "fqdn:${backup_patroni_node}" 'sudo chef-client 2>&1 | tee /dev/tty | logger' | grep --ignore-case 'error'
-
Verify that the postgres_pg_stat_statements_enable
is absent from the node-level state:bundle exec knife node show "${backup_patroni_node}" --format=json | grep 'postgres_pg_stat_statements_enable'
-
Verify that the postgres_pg_stat_activity_enable
is absent from the node-level state:bundle exec knife node show "${backup_patroni_node}" --format=json | grep 'postgres_pg_stat_activity_enable'
Monitoring
Key metrics to observe
- Metric: The patroni Service Apdex
- Location: https://dashboards.gitlab.net/d/patroni-main/patroni-overview?viewPanel=3543037459&orgId=1
- What changes to this metric should prompt a rollback: A sustained (longer than one or two minutes) reduction of apdex at any degree.
- Metric: Node CPU
- Location: https://dashboards.gitlab.net/d/patroni-main/patroni-overview?viewPanel=60&orgId=1
- What changes to this metric should prompt a rollback: A sustained elevation (longer than one or two minutes) of Node CPU on any patroni host.
Summary of infrastructure changes
-
Does this change introduce new compute instances? No
-
Does this change re-size any existing compute instances? No
-
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
This Change Issue is linked to the appropriate Issue and/or Epic -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers
and this issue and await their acknowledgment.) -
There are currently no active incidents.
Edited by Nels Nelson