GitLab Database SOS
# What
Create a tool that self-managed users can run against their database to collect detailed and standardized troubleshooting information.
# Why
\~"group::database" often receives RFH requests that involve a lot of database back-and-forth to collect troubleshooting data. Automating and standardizing this data collection will have several benefits:
1. Reduced round-trip data gathering will improve time-to-analysis and thus end-to-end ticket resolution times.
2. A standardized format for data enables writing automated analysis tooling against the gathered data, improving speed of response and empowering support engineers to run that tooling themselves. Eventually this could pull \~"group::database" out of the loop for simple issues entirely.
3. Automated tooling can allow much more data collection. Most manual data provide a single snapshot of the postgres system. With an automated tool, we could collect data over a period of a few minutes and get a much more accurate picture of what's happening in the database.
# Implementation Sketch
1. Collect data for a variety of metrics on every database (primary and replica, main, ci, etc), every 5 seconds in a loop, for 5 minutes or until cancelled:
* [x] All data from `pg_stat_statements` (if it's installed)
* [x] Table sizes for all relations
* [x] Snapshots of `pg_stat_activity`
* [x] Snapshots of `pg_locks`, taken at the same time as `pg_stat_activity` so data can be correlated.
* [x] Snapshots of `pg_stat_user_tables`, `pg_stat_user_indexes`, `pg_statio_user_tables`, `pg_statio_user_indexes`
2. Collect one-off data about database setup:
* [x] Database version and platform (omnibus, rds, gcp, etc.)
* [x] Number of read-replicas in each cluster
* [x] All settings (`SHOW ALL` output)
* [x] User-specific settings about search paths (visible in `pg_role_db_settings`)
* [x] Settings for tables that have custom stats or autovacuum settings
* [x] A full dump of all schemas (default, partitioned schemas, any unknown schemas on the same database) (so that inconsistencies can be noticed)
* [x] Information about which constraints and indexes are `INVALID` (from `pg_class` and `pg_constraint`)
* [x] [x] Status of batched background migrations
* [x] Collation checking for every database (https://docs.gitlab.com/ee/administration/postgresql/upgrading_os.html#matching-collation-example )
* [x] 4 . documentation (https://docs.gitlab.com/administration/raketasks/maintenance/#collect-information-and-statistics-about-the-database)
* [x] 5\. Release announcements
* [x] 6\. Support docs (https://docs.google.com/document/d/153aP5n0f4KvyoJTIM4pP_LBc_Mx87Dkvb1Y2VsJMnHI/edit?tab=t.0)
<!-- STATUS NOTE START -->
## Status 2025-04-15
:tada: **achievements**
* Finished q&a doc for support explaining db sos data.
_Copied from https://gitlab.com/groups/gitlab-com/-/epics/2456#note_2450978325_
<!-- STATUS NOTE END -->
epic