2023-11-22: High database load caused slow and unresponsive merge requests and CI pipelines

added an incident timeline event

added IncidentActive ServiceSidekiq Source::IMAIncidentDeclare a:SidekiqServiceSidekiqQueueingApdexSLOViolationSingleShard a:walgBackupDelayed blocks deployments blocks feature-flags incident severity2 labels

changed the description

added a resource link

Got an internal report of errors: https://gitlab.slack.com/archives/CETG54GQ0/p1700677297759869

Sidekiq apdex is about 60%.

source

Three impacted shards:

source

changed the severity to High - S2

added RootCauseNeeded label

Hi ,

This issue now has the CorrectiveActionsNeeded label, this label will be removed automatically when there is at least one related issue that is labeled with corrective action or ~"infradev". Having an issue related with these labels helps to ensure a similar incident doesn't happen again.

If you are certain that this incident doesn't require any corrective actions, add the CorrectiveActionsNotNeeded label to this issue with a note explaining why.

You are welcome to help improve this comment.

added CorrectiveActionsNeeded label

I found disk space saturation on Patroni and started fixing the disk space.

Moved a few log files to /var/opt/gitlab/postgres

root@patroni-main-v14-101-db-gprd.c.gitlab-production.internal:/var/log# du -sh * | sort -rh  | head
92G	gitlab
5.8G	td-agent
262M	syslog.1
210M	syslog
123M	prometheus
66M	ubuntu-advantage.log.1
47M	ubuntu-advantage.log
40M	wal-g
11M	syslog.7.gz
11M	syslog.6.gz
root@patroni-main-v14-101-db-gprd.c.gitlab-production.internal:/var/log# df -h .
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdc         98G   98G     0 100% /var/log

root@patroni-main-v14-101-db-gprd.c.gitlab-production.internal:/var/log/gitlab/postgresql# df -h .
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdc         98G   94G  4.3G  96% /var/log

This is about sdc (/var/log), separate from PGDATA, but it's where Postgres log collector writes Postgres logs.

As discussing this on a Zoom call, we have questions:

Why log generation rates suddenly increased
Did we have an alert regarding lack of free space in /var/log

Looks like we have a lot of slow query (exceeding log_min_duration_statement which is 1s) consisting of LOTs of numeric ID values such as:

0000,"duration: 1526.876 ms  bind <unnamed>: /*application:sidekiq,correlation_id:adb689ff7e0ebed5a8dafb77da2c6086,jid:0b1f5ff4d84f447e26794032,endpoint_id:Security::ProcessScanResultPolicyWorker,db_config_name:main*/ SELECT ""software_license_policies"".""id"" FROM ""software_license_policies"" WHERE ""software_license_policies"".""project_id"" = [redacted] AND ""software_license_policies"".""scan_result_policy_id"" IN ([redacted], [redacted], [redacted], [redacted], ...

To prevent /var/log being filled up, temporarily I did:

gitlabhq_production=# show log_min_duration_statement ;
 log_min_duration_statement
----------------------------
 1s
(1 row)

gitlabhq_production=# alter system set log_min_duration_statement = '10s';
ALTER SYSTEM
gitlabhq_production=# select pg_reload_conf();
 pg_reload_conf
----------------
 t
(1 row)

UPD: still moving big postgresql.csv log (renamed to postgresql.csv.bak.prodissue17168) to /var/opt/gitlab/postgresql/postgres-log-backup-112223, progress: 82G of 89G

Finished

But /var/log is still 100% -- manual attempts, for the record:

gitlabhq_production=# show log_filename;
  log_filename
----------------
 postgresql.log
(1 row)

gitlabhq_production=# alter system set log_filename = 'postgresql.tmpname.log';
ALTER SYSTEM
gitlabhq_production=# select pg_reload_conf();
 pg_reload_conf
----------------
 t
(1 row)

gitlabhq_production=# select pg_rotate_logfile();
 pg_rotate_logfile
-------------------
 t
(1 row)

that ^^ didn't help; but @msmiley confirmed that with lsof | grep postgresql.csv.bak.prodissue1716 it's not Postgres who is still holding the deleted file open, not allowing to reclaim disk space:

# lsof | grep postgresql.csv.bak.prodissue17168
mtail        2216                              root   18r      REG               8,32 94991122432    6160393 /var/log/gitlab/postgresql/postgresql.csv.bak.prodissue17168 (deleted)
mtail        2216    2336 mtail                root   18r      REG               8,32 94991122432    6160393 /var/log/gitlab/postgresql/postgresql.csv.bak.prodissue17168 (deleted)
mtail        2216    2337 mtail                root   18r      REG               8,32 94991122432    6160393 /var/log/gitlab/postgresql/postgresql.csv.bak.prodissue17168 (deleted)
mtail        2216    2338 mtail                root   18r      REG               8,32 94991122432    6160393 /var/log/gitlab/postgresql/postgresql.csv.bak.prodissue17168 (deleted)
mtail        2216    2339 mtail                root   18r      REG               8,32 94991122432    6160393 /var/log/gitlab/postgresql/postgresql.csv.bak.prodissue17168 (deleted)
mtail        2216    2340 mtail                root   18r      REG               8,32 94991122432    6160393 /var/log/gitlab/postgresql/postgresql.csv.bak.prodissue17168 (deleted)
...

ruby      3679393 3679849 fluent_lo            root   57r      REG               8,32 94991122432    6160393 /var/log/gitlab/postgresql/postgresql.csv.bak.prodissue17168 (deleted)
ruby      3679393 3680221 utils.rb:            root   57r      REG               8,32 94991122432    6160393 /var/log/gitlab/postgresql/postgresql.csv.bak.prodissue17168 (deleted)
ruby      3679393 4029654 default-e            root   57r      REG               8,32 94991122432    6160393 /var/log/gitlab/postgresql/postgresql.csv.bak.prodissue17168 (deleted)

I've reset the value of log_filename:

gitlabhq_production=# alter system reset log_filename;
ALTER SYSTEM
gitlabhq_production=# select pg_reload_conf();
 pg_reload_conf
----------------
 t
(1 row)

Keeping log_min_duration_statement 10s some time:

cat postgresql.auto.conf
# Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.
log_min_duration_statement = '10s'

@msmiley gracefully restarted the processes holding the deleted file, so free disk space is back

df -hT /var/log
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/sdc       ext4   98G  9.4G   89G  10% /var/log

Just to follow up here while I'm reading things: We received an alertmanager alert on disk space, but we did not receive a page.

https://gitlab.slack.com/archives/C101F3796/p1700682625717739

Reverting log_min_duration_statement back to regular 1s:

gitlabhq_production=# alter system reset log_min_duration_statement ;
ALTER SYSTEM
gitlabhq_production=# select pg_reload_conf();
 pg_reload_conf
----------------
 t
(1 row)

gitlabhq_production=# \! hostname
patroni-main-v14-101-db-gprd

Just to follow up here while I'm reading things: We received an alertmanager alert on disk space, but we did not receive a page.

@stejacks-gitlab do we need a followup to upgrade this to a paging alert?

@jarv Cheryl created one -- production-engineering#24807 (closed)

A significant amount of logging is happening in /var/log/gitlab/postgresql

It has consumed all the space we have manually released.

root@patroni-main-v14-101-db-gprd.c.gitlab-production.internal:/var/log/gitlab/postgresql# du -sh * | sort -rh  | head
89G	postgresql.csv
1006M	postgresql.csv.6.gz
991M	postgresql.csv.7.gz
668M	postgresql.csv.3.gz

root@patroni-main-v14-101-db-gprd.c.gitlab-production.internal:/var/log/gitlab/postgresql# df -h .
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdc         98G   98G     0 100% /var/log

Current Status:

We're seeing CPU Saturation on our Sidekiq shards and seeing a spike from Security::ProcessScanResultPolicyWorker now up to 1mil jobs in the Sidekiq queue.
Slow queries consistent with huge amount of ids.
Possibly looking into whether we can isolate Sidekiq shard.
Currently investigating the saturation of Database logs on the primary.

mentioned in merge request gitlab-org/frontend/pajamas-adoption-scanner!237 (merged)

@msmiley @stejacks-gitlab Here is the MR for the pool_size reduction - https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/4232

The pipeline is still running.

The team decided to NOT revert the pool_size MR for now - https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/4232

cc @msmiley @stejacks-gitlab @NikolayS @rmar1

I'm reverting this with https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/4234 now, just as a head's up.

Updated Status - looking into several issues:

Ran out of disk space on primary main (DB logs) - Nikolay S, Biren S
Current theory is that there is a specific Enterprise customer whose usage with our policies is causing the spike in the Sidekiq worker Security::ProcessScanResultPolicyWorker - Jamie Reid has reached out to the user and Alan P and Phil C from Secure will help understand what their use case is and how it might be attributed to root cause.
Separating worker into its own Sidekiq shard by flipping a feature flag to defer the worker - Stephanie Jackson, Matt Smiley

Total number of customer reports is still around 5.

mentioned in merge request sosy-lab/benchmarking/fm-tools!169 (merged)

The current value is 45

knife ssh "role:gprd-base-db-pgbouncer-sidekiq"  "sudo pgb-console -c \"SHOW DATABASES;\""

Pipeline is still running https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/pipelines/1082073550

Merge in progress - https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/pipelines/1082087409

Alright, the chef-client finished and we have reduced the pool_size from 45 to 40

            name             |             host              | port |        database         | force_user | pool_size | min_pool_size | reserve_pool | pool_mode | max_connections | current_connections | paused | disabled
-----------------------------+-------------------------------+------+-------------------------+------------+-----------+---------------+--------------+-----------+-----------------+---------------------+--------+----------
 gitlabhq_geo_production     |                               | 5432 | gitlabhq_geo_production |            |         1 |             0 |            5 |           |               0 |                   0 |      0 |        0
 gitlabhq_production_sidekiq | master.patroni.service.consul | 5432 | gitlabhq_production     |            |        40 |             0 |            0 |           |               0 |                  43 |      0 |        0
 pgbouncer                   |                               | 6432 | pgbouncer               | pgbouncer  |         2 |             0 |            0 | statement |               0 |                   0 |      0 |        0
(3 rows)

Status update:

We've disabled the worker from running and are seeing updated metrics. Apdex is growing and Sidekiq saturation is dropping.
Outstanding items:
- Still not able to see metrics, SREs to restart the metrics node.
- DBREs still looking into database logs taking up disk space.
- Secure Engineering: Will need to communicate impact of disabling this worker with the Enterprise customer and how it impacts the disabling of policies.

Customer impacted section has been updated

@stejacks-gitlab could we update the incident status please if this is mitigated? Please also provide remaining tasks, DRI and ETA when we can fully resolved. Thank you.

I think @cheryl.li is covering some of this, but I put it in mitigated.

@stejacks-gitlab Feel free to add to the list of corrective action.

changed the description

corrective action:

Monitoring for Database logs causing out of disk space (paged at 100%)
Better documentation for how to disable the worker via the FF / chatops in the runbook for SREs.
Secure Engineering to review why this Enterprise customer's activity caused this saturation (e.g. 85000 ids generated with the IN the select query) - actual fix and how to test in future. Improving DB queries:
- Populating this IN list with a upperbound. (Rubocop rule?)
- Move our read-only queries to replica.
Understand if the Sidekiq worker metrics overall or in the `catchall shard could've indicated (alerted us) on the system impact given the saturation?

Secure Engineering to review why this Enterprise customer's activity caused this saturation (e.g. 85000 ids generated with the IN the select query) - actual fix and how to test in future. Improving DB queries:

Populating this IN list with a upperbound. (Rubocop rule?)

Move our read-only queries to replica.

I agree about adding a Rubocop rule to capture whenever we are using pluck without setting a limit to capture that.

Restore quarantined scan result policy spec (gitlab-org/gitlab#432748 - closed)
Investigate and improve delete_software_license... (gitlab-org/gitlab#432749 - closed)

Update on this one:

Monitoring for Database logs causing out of disk space (alerted in #production at 100%, DID NOT PAGE)

We need it to page the EOC if we're getting close, and preferably before it hits 100%.

assigned to @cheryl.li

added IncidentMitigated label and removed IncidentActive label

Remaining task:

We will need at some point to re-enable the "run_sidekiq_jobs_Security::ProcessScanResultPolicyWorker" feature flag to allow that worker to work once again, but we need code changes to happen before we're okay with that. I believe that someone is working on that, but Cheryl had more details.