2024-09-10: SidekiqServiceSidekiqQueueingApdexSLOViolationSingleShard

added an incident timeline event

added IncidentActive ServiceKas Source::IMAIncidentDeclare a:ApiServiceNginxIngressErrorSLOViolationRegional a:BlackboxProbeFailures a:KasServiceGrpcRequestsErrorSLOViolation blocks deployments blocks feature-flags incident severity2 labels

assigned to @bastirehm

changed the description

changed the severity to High - S2

added a resource link

mentioned in issue on-call-handovers#5304 (closed)

changed the description

Redis trace chunks is running out of memory. Impact of this is not 100% clear. We’re going to try scaling up Redis trace chunks as a temporary mitigation strategy

added Incident-CommsStatus-Page label

Redis Trace chunks impact

Hi @bastirehm,

This issue now has the CorrectiveActionsNeeded label, this label will be removed automatically when there is at least one related issue that is labeled with corrective action or ~"infradev". Having an issue related with these labels helps to ensure a similar incident doesn't happen again.

If you are certain that this incident doesn't require any corrective actions, add the CorrectiveActionsNotNeeded label to this issue with a note explaining why.

You are welcome to help improve this comment.

added CorrectiveActionsNeeded label

Hi @bastirehm,

Thanks for taking part in this incident! It looks like this incident needs an async Incident Review issue, please use the Incident Review link in the incident's description to create one.

We're posting this message because this issue meets the following criteria:

It is severity1 / severity2, or has a review-requested label
There is no related issue with an incident-review label

If you are certain that this incident doesn't require an incident review, add the IncidentReviewNotNeeded label to this issue with a note explaining why.

Thanks for your help!

You are welcome to help improve this comment.

added IncidentReviewNeeded label

changed the description

Manual procedure for bumping machine type to 120G of memory:

gcloud --project gitlab-production compute instances stop redis-tracechunks-01-db-gprd
gcloud --project gitlab-production compute instances set-machine-type --machine-type c2-standard-30 redis-tracechunks-01-db-gprd
gcloud --project gitlab-production compute instances start redis-tracechunks-01-db-gprd
gcloud --project gitlab-production compute instances tail-serial-port-output redis-tracechunks-01-db-gprd

Config-mgmt MR: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/9213.

Verified that 03 is the primary.

 PRODUCTION REDIS_CHECKCMD_ERROR iwiedler@redis-tracechunks-01-db-gprd.c.gitlab-production.internal:~$ sudo gitlab-ctl status

run: logrotate: (pid 3146) 24s; run: log: (pid 3144) 24s
run: redis: (pid 3149) 24s; run: log: (pid 3148) 24s
run: sentinel: (pid 3156) 23s; run: log: (pid 3147) 24s

We resized 03 now as well. 02 is still the small machine type.

 PRODUCTION secondary-redis iwiedler@redis-tracechunks-02-db-gprd.c.gitlab-production.internal:~$ sudo gitlab-ctl stop redis
ok: down: redis: 0s, normally up

03 is the new primary, 01 is replicating:

 PRODUCTION REDIS_CHECKCMD_ERROR iwiedler@redis-tracechunks-03-db-gprd.c.gitlab-production.internal:~$ sudo gitlab-redis-cli role
1) "master"
2) (integer) 5663497198
3) 1) 1) "10.217.9.101"
      2) "6379"
      3) "5649594154"

 PRODUCTION secondary-redis iwiedler@redis-tracechunks-01-db-gprd.c.gitlab-production.internal:~$ sudo gitlab-redis-cli role
1) "slave"
2) "10.217.9.103"
3) (integer) 6379
4) "connected"
5) (integer) 5433086118

We have bumped 02 to 120G as well now.

mentioned in incident #18535 (closed)

Update

Redis machines are getting saturated, specifically Redis trace-chunks responsible for CI job traces. Trying fail over to other machines to mitigate.

changed the severity to Critical - S1

added severity1 label and removed severity2 label

Update

Bumping to S1, CI is heavily affected since it's reliant on job traces which are stuck in Redis. From a user's point of view CI jobs will appear stuck.

changed the description

assigned to @vyaklushin

unassigned @bastirehm

Resized redis-tracechunks fleet has now stabilized. Memory utilization is growing, but we have quite a bit of headroom.

mentioned in commit gitlab-com/gl-infra/k8s-workloads/gitlab-com@32bdccd1

mentioned in merge request gitlab-com/gl-infra/k8s-workloads/gitlab-com!3852 (closed)

Update

Investigating Sidekiq related processes to discover the root cause of high CPU utilization.

Update

Several pods responsible for Sidekiq processing got CPU saturated. As a result, that increased the number of unprocessed records in Redis.

We have restarted affected pods.

added review-requested label

mentioned in issue #18352

Update

Restarting of pods improved the situation. Sidekiq workers started to process jobs from the queue. The queue size is decreasing. We monitoring the status of the application.

changed the description

Corrective Action make sure that Redis failures don't generate a Sentry exception, otherwise Sentry will become overloaded and start rate-limiting.

During the redis-tracechunks outage, Sidekiq tried and failed >343,000 times to send exceptions to Sentry and got a 5xx error which then got logged to stdout/stderr. https://log.gprd.gitlab.net/app/r/s/B9n2w

The engineers who use Sentry don't get much value from these errors being in Sentry. I think we should remove them.

Update

We keep monitoring the current status of the application. Background jobs started after the pod's restart should be processed normally. However, older ones are still in queue. It might take ~20-30 minutes until the queue is empty.

changed the description

Update

The background jobs queue is empty. All stuck jobs should be processed. We're verifying that the system is stable.

changed the description

added IncidentResolved label and removed IncidentActive label

mentioned in issue production-engineering#25797 (closed)

Update

I'm going to mark incident as resolved. We still need to identify the root cause of the pod saturation, but incident's impact on customers should be mitigated by now.

changed the description

Incident review issue: #18539 (closed)

mentioned in issue #18539 (closed)

This incident was automatically closed because it has the IncidentResolved label.

Note: All incidents are closed automatically when they are resolved, even when there is a pending review. Please see the Incident Workflow section on the Incident Management handbook page for more information.

closed

changed the incident status to Resolved by closing the incident

Example of a "bad pod", this time as seen from Prometheus.

The pod starts off processing jobs. After the postgres incident, something changes, and although it continues to run, it no longer processes any jobs, going into a state of senescence, running but not doing any work.

2024-09-10: SidekiqServiceSidekiqQueueingApdexSLOViolationSingleShard

Customer Impact

Current Status

Summary for CMOC notice / Exec summary:

References and helpful links

Deployment Guidance

Relates to

Activity

Average Jobs Completed per Second for Active Pods vs Senescent Pods

Average CPU Usage for Active Pods vs Senescent Pods

Inactive Pods (this incident)

Inactive Pods (#18489 (closed))

Offline events

Online events