2019-11-28 GitLab.com down
Summary
GitLab.com is up
A change to roll out ip-tables to other non gitlab.com hosts was inadvertently applied to the database hosts. That change to host firewalling caused all web and api hosts to lose connectivity to the database. The change has been rolled back and we are now restarting host processes.
Timeline
All times UTC.
- 11:18 Status.io posted
- 11:25 Andrew discovers PGBouncer connections unavailable from Rails console
- 11:29 Jose Finitto confirms patroni-06 is the leader
- 11:30 Anthony Sandoval pages OnGres support
- 11:42 Álvaro Hernández confirms that there is no TCP connectivity from the pgbouncer nodes to the Postgres master node, port 5432 is filtered
- 11:45 Craig Furman identifies a Chef change that was applied which added iptables filters https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/2237/diffs
- 11:48 rolled back manually on pg nodes. Unicorns not coming up
- 11:49 Álvaro Hernández verifies pgbouncers can now connect to the PG master
- 11:54 Jarv HUPs web01 in an attempt to reestablish connectivity to the database.
- 12:02 Closed the front door - all nodes in maint mode in LBs
- 12:07 Readiness checks flapping.
- 12:11 Jarv Stopped HAProxy to
- 12:22 sigkill sent to applications
- 12:25 haproxy restarted “front doors opened”
- 12:27 Anthony hands off IMOC to Dave Smith
- 12:29 Web is still not healthy
- 12:35 Looking into latency problems on redis-cache
- 12:36 Flushing iptables on patroni 11 and 12
- 12:39 with a HUP on unicorns - things appear to be recovering
Corrective actions noted
- We should move ops.gitlab.net to use a non gitlab.com registry - pipelines could not run because they were trying to pull from prod registry
- For dealing with unicorn restarts - with the blackout period, we should just do a stop/start instead of a reload
- We should implement a quick way to put all the nodes in maint mode in HAProxy to “close the front door”
- Or just close at the Google LB?
- Have psql installed on webhost to check connectivity
- Setup a 400/500 page in Cloudflare to display a response when HAProxy is down.
- We need to investigate why losing just 2 patroni node affected us
- IP -tables shouldn’t drop, we should reject
Edited by 🤖 GitLab Bot 🤖