[meta] Listing all issues related to Jan 31st outage to track their progress

Tracking all the issues that were spawned from or referenced in the Jan 31st outage blog post-mortem: https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/

Mostly doing this for my own benefit, but feel free to update or recommend a better way of tracking this.

Issues and their updates:

  • 🔶 Removal of users by spam should not hard delete https://gitlab.com/gitlab-org/gitlab-ce/issues/27581
  • ✅ Update PS1 across all hosts to more clearly differentiate between hosts and environments (#1094 (moved))
  • ✅ Prometheus monitoring for backups (#1095 (closed))
    • Now went to zero-loss continuous streaming to S3 bucket and Azure blob using WAL-E (#1152 (closed))
    • WAL-E implemented per #1152 (closed), but no monitoring available yet so #1095 (closed) remains open.
  • ✅ Set PostgreSQL's max_connections to a sane value (#1096 (moved))
  • ✅ Investigate Point in time recovery & continuous archiving for PostgreSQL (#1097 (closed))
    • Was closed based on comment https://gitlab.com/gitlab-com/infrastructure/issues/494#note_23009747 (using Wal-E instead of PITR).
  • ✅ Hourly LVM snapshots of the production databases (#1098 (moved))
  • ✅ Azure disk snapshots of production databases (#1099 (moved))
    • Superseded by "Fix Azure snapshots" (#1606 (closed))
    • #1606 (closed) now dependent on "Convert GitLab ARM Hosts to "Managed Disk"" (#1649 (closed))
  • ✅ Move staging to the ARM environment (#1100 (moved))
  • ✅ Recover production replica(s) (#1101 (closed))
  • 🔶 Automated testing of recovering PostgreSQL database backups (#1102 (closed))
    • Superseded by Automate restoring a database with Wal-E (#1265 (moved))
  • ✅ Improve PostgreSQL replication documentation/runbooks (#1103 (closed))
  • ✅ Investigate pgbarman for creating PostgreSQL backups (#1105 (closed))
    • Closed by decision to use Wal-E (https://gitlab.com/gitlab-com/infrastructure/issues/494#note_23009747)
  • ✅ Investigate using WAL-E as a means of Database Backup and Realtime Replication (#494 (closed))
  • ✅ Build Streaming Database Backup (#1152 (closed))
  • ✅ Assign an owner for data durability (#1163 (closed))
  • ✅ Merge Request: Bundle pgpool-II 3.6.1 (gitlab-org/omnibus-gitlab!1251 (closed))
    • Closed in favor of "Adding pgbouncer as EE specific dependency" (gitlab-org/omnibus-gitlab!1345 (merged))
  • ✅ Connection pooling/load balancing for PostgreSQL (#259 (closed))
    • Superseded by setting up pgbouncer (#1440 (closed))
  • 🔶 Tool for executing and reverting Rails migrations on staging (#811 (closed))
    • Lower priority than other tasks, and potential to be superseded by #1504 (closed).
  • 🔶 Disaster recovery for everything that is not the database (#1161 (closed))
    • This is a meta issue itself, with various linked issues that may take quite a while still.
Edited Jul 18, 2017 by Ernst van Nierop
Assignee Loading
Time tracking Loading