[meta] Listing all issues related to Jan 31st outage to track their progress
Tracking all the issues that were spawned from or referenced in the Jan 31st outage blog post-mortem: https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/
Mostly doing this for my own benefit, but feel free to update or recommend a better way of tracking this.
Issues and their updates:
-
🔶 Removal of users by spam should not hard delete https://gitlab.com/gitlab-org/gitlab-ce/issues/27581 -
✅ Update PS1 across all hosts to more clearly differentiate between hosts and environments (#1094 (moved)) -
✅ Prometheus monitoring for backups (#1095 (closed))- Now went to zero-loss continuous streaming to S3 bucket and Azure blob using WAL-E (#1152 (closed))
- WAL-E implemented per #1152 (closed), but no monitoring available yet so #1095 (closed) remains open.
-
✅ Set PostgreSQL's max_connections to a sane value (#1096 (moved)) -
✅ Investigate Point in time recovery & continuous archiving for PostgreSQL (#1097 (closed))- Was closed based on comment https://gitlab.com/gitlab-com/infrastructure/issues/494#note_23009747 (using Wal-E instead of PITR).
-
✅ Hourly LVM snapshots of the production databases (#1098 (moved)) -
✅ Azure disk snapshots of production databases (#1099 (moved))- Superseded by "Fix Azure snapshots" (#1606 (closed))
- #1606 (closed) now dependent on "Convert GitLab ARM Hosts to "Managed Disk"" (#1649 (closed))
-
✅ Move staging to the ARM environment (#1100 (moved)) -
✅ Recover production replica(s) (#1101 (closed)) -
🔶 Automated testing of recovering PostgreSQL database backups (#1102 (closed))- Superseded by Automate restoring a database with Wal-E (#1265 (moved))
-
✅ Improve PostgreSQL replication documentation/runbooks (#1103 (closed)) -
✅ Investigate pgbarman for creating PostgreSQL backups (#1105 (closed))- Closed by decision to use Wal-E (https://gitlab.com/gitlab-com/infrastructure/issues/494#note_23009747)
-
✅ Investigate using WAL-E as a means of Database Backup and Realtime Replication (#494 (closed)) -
✅ Build Streaming Database Backup (#1152 (closed)) -
✅ Assign an owner for data durability (#1163 (closed)) -
✅ Merge Request: Bundle pgpool-II 3.6.1 (gitlab-org/omnibus-gitlab!1251 (closed))- Closed in favor of "Adding pgbouncer as EE specific dependency" (gitlab-org/omnibus-gitlab!1345 (merged))
-
✅ Connection pooling/load balancing for PostgreSQL (#259 (closed))- Superseded by setting up pgbouncer (#1440 (closed))
-
🔶 Tool for executing and reverting Rails migrations on staging (#811 (closed))- Lower priority than other tasks, and potential to be superseded by #1504 (closed).
-
🔶 Disaster recovery for everything that is not the database (#1161 (closed))- This is a meta issue itself, with various linked issues that may take quite a while still.
Edited by Ernst van Nierop