Root cause: A SELECT query on the primary db was inefficient under load (the query clause generated 85K ids in its IN clause), and the worker ended up saturating the Sidekiq queue trying to process all the jobs on the queue.
Note:
In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in our handbook page. Any of this confidential data will be in a linked issue, only visible internally.
By default, all information we can share, will be public, in accordance to our transparency value.
Security Note:
If anything abnormal is found during the course of your investigation, please do not hesitate to contact security.
No timeline items have been added yet.
✓
10 of 10 checklist items completed
· Edited by
Cheryl Li
This issue now has the CorrectiveActionsNeeded label, this label will be removed automatically when there is
at least one related issue that is labeled with corrective action or ~"infradev".
Having an issue related with these labels helps to ensure a similar incident doesn't happen again.
If you are certain that this incident doesn't require any corrective actions, add the
CorrectiveActionsNotNeeded label to this issue with a note explaining why.
Looks like we have a lot of slow query (exceeding log_min_duration_statement which is 1s) consisting of LOTs of numeric ID values such as:
0000,"duration: 1526.876 ms bind <unnamed>: /*application:sidekiq,correlation_id:adb689ff7e0ebed5a8dafb77da2c6086,jid:0b1f5ff4d84f447e26794032,endpoint_id:Security::ProcessScanResultPolicyWorker,db_config_name:main*/ SELECT ""software_license_policies"".""id"" FROM ""software_license_policies"" WHERE ""software_license_policies"".""project_id"" = [redacted] AND ""software_license_policies"".""scan_result_policy_id"" IN ([redacted], [redacted], [redacted], [redacted], ...
To prevent /var/log being filled up, temporarily I did:
gitlabhq_production=# show log_min_duration_statement ; log_min_duration_statement---------------------------- 1s(1 row)gitlabhq_production=# alter system set log_min_duration_statement = '10s';ALTER SYSTEMgitlabhq_production=# select pg_reload_conf(); pg_reload_conf---------------- t(1 row)
UPD: still moving big postgresql.csv log (renamed to postgresql.csv.bak.prodissue17168) to /var/opt/gitlab/postgresql/postgres-log-backup-112223, progress: 82G of 89G
that ^^ didn't help; but @msmiley confirmed that with lsof | grep postgresql.csv.bak.prodissue1716 it's not Postgres who is still holding the deleted file open, not allowing to reclaim disk space:
We're seeing CPU Saturation on our Sidekiq shards and seeing a spike from Security::ProcessScanResultPolicyWorker now up to 1mil jobs in the Sidekiq queue.
Slow queries consistent with huge amount of ids.
Possibly looking into whether we can isolate Sidekiq shard.
Currently investigating the saturation of Database logs on the primary.
Ran out of disk space on primary main (DB logs) - Nikolay S, Biren S
Current theory is that there is a specific Enterprise customer whose usage with our policies is causing the spike in the Sidekiq worker Security::ProcessScanResultPolicyWorker - Jamie Reid has reached out to the user and Alan P and Phil C from Secure will help understand what their use case is and how it might be attributed to root cause.
Separating worker into its own Sidekiq shard by flipping a feature flag to defer the worker - Stephanie Jackson, Matt Smiley
We've disabled the worker from running and are seeing updated metrics. Apdex is growing and Sidekiq saturation is dropping.
Outstanding items:
Still not able to see metrics, SREs to restart the metrics node.
DBREs still looking into database logs taking up disk space.
Secure Engineering: Will need to communicate impact of disabling this worker with the Enterprise customer and how it impacts the disabling of policies.
@stejacks-gitlab could we update the incident status please if this is mitigated? Please also provide remaining tasks, DRI and ETA when we can fully resolved. Thank you.
Monitoring for Database logs causing out of disk space (paged at 100%)
Better documentation for how to disable the worker via the FF / chatops in the runbook for SREs.
Secure Engineering to review why this Enterprise customer's activity caused this saturation (e.g. 85000 ids generated with the IN the select query) - actual fix and how to test in future. Improving DB queries:
Populating this IN list with a upperbound. (Rubocop rule?)
Move our read-only queries to replica.
Understand if the Sidekiq worker metrics overall or in the `catchall shard could've indicated (alerted us) on the system impact given the saturation?
Secure Engineering to review why this Enterprise customer's activity caused this saturation (e.g. 85000 ids generated with the IN the select query) - actual fix and how to test in future. Improving DB queries:
Populating this IN list with a upperbound. (Rubocop rule?)
Move our read-only queries to replica.
I agree about adding a Rubocop rule to capture whenever we are using pluck without setting a limit to capture that.
We will need at some point to re-enable the "run_sidekiq_jobs_Security::ProcessScanResultPolicyWorker" feature flag to allow that worker to work once again, but we need code changes to happen before we're okay with that. I believe that someone is working on that, but Cheryl had more details.
Let the current code run at low concurrency (5 jobs) until the MR above gets merged and reaches production.
Then re-check the job's throughput while still keeping it on the database_throttled shard. It ought to improve significantly just by the query performance improvement.
Consider moving the backlog of jobs to the quarantine shard, but be ready to disable them again via the feature flag if the workload on the primary db gets dangerously busy.
If all looks well, move the job back to the catchall shard.
To prevent on-call from getting constantly paged around database_throttled SLO I've created 2 new silences and add them to next steps to expire them once we fix the queue length:
This is one of the queries that I found running in the primary database:
SELECT "scan_result_policy_violations"."id" FROM "scan_result_policy_violations" WHERE "scan_result_policy_violations"."scan_result_policy_id" IN ( SELECT "scan_result_policies"."id" FROM "scan_result_policies" WHERE "scan_result_policies"."security_orchestration_policy_configuration_id" = xxxx AND "scan_result_policies"."project_id" = xxxx ) ORDER BY "scan_result_policy_violations"."id" ASC, "scan_result_policy_violations"."updated_at" ASC LIMIT 1
We do see json.db_duration_s improvements but it's hard to say since the queries where already performing well since the large projects are not updating anything.
Mek Stittrichanged title from 2023-11-22: SidekiqServiceSidekiqQueueingApdexSLOViolationSingleShard for urgent-cpu-bound to 2023-11-22: High database load caused slow and unresponsive merge requests and CI pipelines
changed title from 2023-11-22: SidekiqServiceSidekiqQueueingApdexSLOViolationSingleShard for urgent-cpu-bound to 2023-11-22: High database load caused slow and unresponsive merge requests and CI pipelines
Mek Stittrichanged the descriptionCompare with previous version
We moved this worker to database_throttled last night which has a concurrency of 5 and a maxreplica of 1, which is going to take a very long time to go through the backlog of 850k jobs. I just approved and merged gitlab-com/gl-infra/k8s-workloads/gitlab-com!3244 (merged) which moves that to quarantine, which has a concurrency of 15, with up to 50 maxreplicas.