crunchy-postgres-health-check
Deliverables
Health Check
-
Analyze existing configuration and architecture, provide recommendations - https://gitlab.com/gitlab-com/infrastructure/issues/1552
- https://gitlab.com/gitlab-com/infrastructure/issues/1553
- https://gitlab.com/gitlab-com/infrastructure/issues/1554
- https://gitlab.com/gitlab-com/infrastructure/issues/1555
- https://gitlab.com/gitlab-com/infrastructure/issues/1556
- https://gitlab.com/gitlab-com/infrastructure/issues/1557
- https://gitlab.com/gitlab-com/infrastructure/issues/1558
- https://gitlab.com/gitlab-com/infrastructure/issues/1559
- https://gitlab.com/gitlab-com/infrastructure/issues/1561
- https://gitlab.com/gitlab-com/infrastructure/issues/1587
- https://gitlab.com/gitlab-com/infrastructure/issues/1588
- https://gitlab.com/gitlab-com/infrastructure/issues/1589
- https://gitlab.com/gitlab-com/infrastructure/issues/1630
- https://gitlab.com/gitlab-com/infrastructure/issues/1652
-
Review system statistics and identify any areas of concern - pending some monitoring changes listed above
-
Review schema and provide recommendations -
Review PostgreSQL log files to identify any issues and recommendations
Backup and Restore
-
Review existing backup methodologies and provide recommendations:
PGBouncer
-
Review pgbouncer.ini and provide recommendations -
Discuss and provide recommendations regarding PGBouncer locations/architecture
Read replicas
-
Review replica stats and log files -
Determine cause of hot_standby causing table bloat: - likely related to issue addressed in 9.6.2
- https://gitlab.com/gitlab-com/infrastructure/issues/1501, decision is to upgrade to 9.6.3: https://gitlab.com/gitlab-com/infrastructure/issues/1158
HA / Failover
-
Review corosync/pacemaker configuration and provide recommendations - depends on https://gitlab.com/gitlab-com/infrastructure/issues/1460; closed in favor of gitlab-org/omnibus-gitlab#1807 (closed) as decided to move away from pacemaker/corosync
-
Assist with understanding of corosync/pacemaker, current cluster status, managing failover - depends on https://gitlab.com/gitlab-com/infrastructure/issues/1460; closed in favor of gitlab-org/omnibus-gitlab#1807 (closed) as decided to move away from pacemaker/corosync
Monitoring
-
Review existing monitoring, provide recommendations for additional monitoring
Tuning / Settings
-
Review postgresql.conf and provide recommendations: -
Review autovacuum runs - requires log files with autovacuum information
-
Review autovacuum settings and provide recommendations
Application binding to replicas
-
We discussed this during our calls but unclear if further discussion or review necessary.
Application improvements
-
Review slow queries for possible optimization - slow queries identified, representative data sample and/or system access would allow testing alternative query structures and assist in identifying possible changes in indexing strategy, see https://gitlab.com/gitlab-com/infrastructure/issues/1786)
Staging with realistic load by running tests
-
Review possibility to enable full logging, to allow collection of all queries to enable replaying on staging
Capacity planning
-
Implementation of pg_bouncer and other improvements will impact capacity planning and scalability -
Review shared buffer utilization through pg_buffercache (requires installation) to determine memory utilization and working set size -
Provide recommendations regarding changes to shared_buffers and system memory amounts -
Recommendations regarding metrics for scalability and capacity planning