sign_in are failing

The DRI for this incident is the incident issue assignee, see roles and responsibilities.

For the assigned roles when the incident declared, see the Timelines tab. For timeline feedback see the dogfooding issue. To save time entering timeline events, use the quick action /timeline.

Current Status

patroni-main-2004-03-db-gstg was the lead main DB on GSTG until /var/opt/gitlab disk was full at 06:15 UTC due to accumulating wal-g files, the postgresql service then crashed, and never recovered until a manual intervention was done.

At 07:42 @f_santos restarted both patroni-main-2004-02-db-gstg (became leader) and patroni-main-2004-03-db-gstg which turned into a replica. But 03 stayed lagging behind because of the full disk, which wasn't noticed until around 14:00 UTC by @mchacon3.

The final resolution to this incident was when @mchacon3, @gsgl and @bshah11 worked together on emptying pg_wal as well as /var/log/gitlab/postgresql/postgresql.csv, after which the space usage recovered and 03 wasn't lagging behind the leader anymore.

The main hypothesis we think why 03 had wal-g accumulating since Spe16th was due to an indeterminate issue with node 05 on staging.

Alert details

Blackbox probes for https://staging.gitlab.com/users/sign_in are failing.
blackbox probe availability https://staging.gitlab.com/users/sign_in is less than 70.00% for the last 10 minutes.

📚 References and helpful links

Recent Events (available internally only):

Deployments ❙ Feature Flag Changes ❙ Gitlab.com Latest Updates
Infrastructure Configurations
GCP Events (e.g. host failure)

Use the following links to create related issues to this incident if additional work needs to be completed after it is resolved:

Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.

Click to expand or collapse the Incident Review section.

Incident Review

This incident came in as a Blackbox alert for the Gitlab staging login. Upon investigation in the Staging environment, the UI was up, but throwing a 500 error. Investigation of the Kubernetes Cluster showed the web services in CrashLoops across all 3 nodes.

From 'Current Status': With the assistance of @f_santos, he pulled logs that indicated a problem with the pods connecting to Patroni.

ct 06 07:21:07 pgbouncer-03-db-gstg pgbouncer[1972365]: C-0x562c3c48c600: gitlabhq_production/gitlab-consul@127.0.0.1:58096 pooler error: pgbouncer cannot connect to server

Ensure that the exec summary is completed at the top of the incident issue, the timeline is updated and relevant graphs are included in the summary
If there are any corrective action items mentioned in the notes on the incident, ensure they are listed in the "Corrective Action" section
Fill out relevant sections below or link to the meeting review notes that cover these topics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. Internal customers (developers, etc.). This also blocked deploys to production.
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. Unable to log into staging.gitlab.com, unable to deploy updates to staging.
How many customers were affected?
1. ...
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. ...

What were the root causes?

Incident Response Analysis

How was the incident detected?
1. Blackbox alert for the inability to hit the staging.gitlab.com/users/sign_in endpoint
How could detection time be improved?
1. We could enable alerting on staging patroni incidents to alert on impending full disks.
How was the root cause diagnosed?
1. DBREs noticed the full disk on the patroni node and cleared it up.
How could time to diagnosis be improved?
1. Yes, we could have alerts on staging patroni nodes for disk usage.
How did we reach the point where we knew how to mitigate the impact?
1. ...
How could time to mitigation be improved?
1. ...
What went well?
1. ...

Post Incident Analysis

Did we have other events in the past with the same root cause? 1.
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. ...
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. ...

What went well?

Guidelines

Blameless RCA Guideline

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Edited Oct 11, 2022 by Shimran George