GitLab Database SOS
# What Create a tool that self-managed users can run against their database to collect detailed and standardized troubleshooting information. # Why \~"group::database" often receives RFH requests that involve a lot of database back-and-forth to collect troubleshooting data. Automating and standardizing this data collection will have several benefits: 1. Reduced round-trip data gathering will improve time-to-analysis and thus end-to-end ticket resolution times. 2. A standardized format for data enables writing automated analysis tooling against the gathered data, improving speed of response and empowering support engineers to run that tooling themselves. Eventually this could pull \~"group::database" out of the loop for simple issues entirely. 3. Automated tooling can allow much more data collection. Most manual data provide a single snapshot of the postgres system. With an automated tool, we could collect data over a period of a few minutes and get a much more accurate picture of what's happening in the database. # Implementation Sketch 1. Collect data for a variety of metrics on every database (primary and replica, main, ci, etc), every 5 seconds in a loop, for 5 minutes or until cancelled: * [x] All data from `pg_stat_statements` (if it's installed) * [x] Table sizes for all relations * [x] Snapshots of `pg_stat_activity` * [x] Snapshots of `pg_locks`, taken at the same time as `pg_stat_activity` so data can be correlated. * [x] Snapshots of `pg_stat_user_tables`, `pg_stat_user_indexes`, `pg_statio_user_tables`, `pg_statio_user_indexes` 2. Collect one-off data about database setup: * [x] Database version and platform (omnibus, rds, gcp, etc.) * [x] Number of read-replicas in each cluster * [x] All settings (`SHOW ALL` output) * [x] User-specific settings about search paths (visible in `pg_role_db_settings`) * [x] Settings for tables that have custom stats or autovacuum settings * [x] A full dump of all schemas (default, partitioned schemas, any unknown schemas on the same database) (so that inconsistencies can be noticed) * [x] Information about which constraints and indexes are `INVALID` (from `pg_class` and `pg_constraint`) * [x] [x] Status of batched background migrations * [x] Collation checking for every database (https://docs.gitlab.com/ee/administration/postgresql/upgrading_os.html#matching-collation-example ) * [x] 4 . documentation (https://docs.gitlab.com/administration/raketasks/maintenance/#collect-information-and-statistics-about-the-database) * [x] 5\. Release announcements * [x] 6\. Support docs (https://docs.google.com/document/d/153aP5n0f4KvyoJTIM4pP_LBc_Mx87Dkvb1Y2VsJMnHI/edit?tab=t.0) <!-- STATUS NOTE START --> ## Status 2025-04-15 :tada: **achievements** * Finished q&a doc for support explaining db sos data. _Copied from https://gitlab.com/groups/gitlab-com/-/epics/2456#note_2450978325_ <!-- STATUS NOTE END -->
epic