Background processing slow. One main impact is CI not working properly and jobs not shown as properly completing due to job traces not being available
Current Status
14:14 - Redis trace chunks is running out of memory, which impacts availability of Job logs. Redis generally getting saturated which affects background processing.
14:18 - We’re going to try scaling up Redis trace chunks as a temporary mitigation strategy
14:52 - Promoted to S1 incident
15:06 - Resized redis-tracechunks fleet has now stabilized. Memory utilization is growing, but we have quite a bit of headroom.
15:41 - We have restarted affected pods.
16:03 - Restarting of pods improved the situation. Sidekiq workers started to process jobs from the queue. The queue size is decreasing. We monitoring the status of the application.
16:31 - We keep monitoring the current status of the application. Background jobs started after the pod's restart should be processed normally. However, older ones are still in queue. It might take ~20-30 minutes until the queue is empty.
16:46 - The background jobs queue is empty. All stuck jobs should be processed. We're verifying that the system is stable.
17:27 - The incident is resolved.
Summary for CMOC notice / Exec summary:
Customer Impact: Slow Background processing and CI not working properly
Note:
In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in our handbook page. Any of this confidential data will be in a linked issue, only visible internally.
By default, all information we can share, will be public, in accordance to our transparency value.
Security Note:
If anything abnormal is found during the course of your investigation, please do not hesitate to contact security.
Redis trace chunks is running out of memory. Impact of this is not 100% clear. We’re going to try scaling up Redis trace chunks as a temporary mitigation strategy
This issue now has the CorrectiveActionsNeeded label, this label will be removed automatically when there is
at least one related issue that is labeled with corrective action or ~"infradev".
Having an issue related with these labels helps to ensure a similar incident doesn't happen again.
If you are certain that this incident doesn't require any corrective actions, add the
CorrectiveActionsNotNeeded label to this issue with a note explaining why.
Thanks for taking part in this incident! It looks like this incident needs
an async Incident Review issue, please use the Incident Review link in
the incident's description to create one.
We're posting this message because this issue meets the following criteria:
If you are certain that this incident doesn't require an incident review, add the
IncidentReviewNotNeeded label to this issue with a note explaining why.
Bumping to S1, CI is heavily affected since it's reliant on job traces which are stuck in Redis. From a user's point of view CI jobs will appear stuck.
Sebastian Rehmchanged the descriptionCompare with previous version
Restarting of pods improved the situation. Sidekiq workers started to process jobs from the queue. The queue size is decreasing. We monitoring the status of the application.
During the redis-tracechunks outage, Sidekiq tried and failed >343,000 times to send exceptions to Sentry and got a 5xx error which then got logged to stdout/stderr. https://log.gprd.gitlab.net/app/r/s/B9n2w
The engineers who use Sentry don't get much value from these errors being in Sentry. I think we should remove them.
We keep monitoring the current status of the application. Background jobs started after the pod's restart should be processed normally. However, older ones are still in queue. It might take ~20-30 minutes until the queue is empty.
I'm going to mark incident as resolved. We still need to identify the root cause of the pod saturation, but incident's impact on customers should be mitigated by now.
This incident was automatically closed because it has the IncidentResolved label.
Note: All incidents are closed automatically when they are resolved, even when there is a pending
review. Please see the Incident Workflow
section on the Incident Management handbook page for more information.
Example of a "bad pod", this time as seen from Prometheus.
The pod starts off processing jobs. After the postgres incident, something changes, and although it continues to run, it no longer processes any jobs, going into a state of senescence, running but not doing any work.
THIS IS INTERESTING we see no more jobs being complete, BUT the sidekiq_running_jobs metric indicates that the running jobs became "frozen" at time of entering senescence. After that, nothing changed:
Here's a graph which plots the number of catchall pods which completed zero jobs in the preceding 5 minute period.
What's interesting is that senescence happened very quickly - in about a 2 minute period starting at 12:58. This means the cause started at 12:53, a few minutes after the initial postgres incident.
It's a strange coincidence, but the senescence appears to have affected pretty close to 50% of pods. This strikes me as odd.
Also interesting is that the total number of catchall pods actually decreased during the incident.
UPDATE: a deployment to sidekiq took place during the incident. The active pods also started roughly around the same time. This is still interesting, but less so.
Every pod that turned senescent was spawned in a short period of time (a few minutes) during the original postgres outage (starting at 12h43Z). Is there something about starting up during this incident that corrupted these pods and led to this situation?
Hypothesis: did the fact that the sidekiq deployment took place during the patroni incident contribute to the outage? It's interesting to note that catchall was the only deployment of sidekiq taking place while the patroni incident was ongoing. Did these initial conditions lead the pods becoming unresponsive, through bad connection pools, or somesuch?
I'm not quite sure what that means, other than perhaps the Sidekiq threads were mostly tied up?
@stanhu not a single job completed on any of the threads during this time, across dozens of pods. According to metrics there was a heterogeneous mixture of jobs running, so unlikely that a single poisonous payload tied up all workers simultaneously. Feels to me that something else happened.
not a single job completed on any of the threads during this time, across dozens of pods. According to metrics there was a heterogeneous mixture of jobs running, so unlikely that a single poisonous payload tied up all workers simultaneously. Feels to me that something else happened.
This could make sense and lead to the type of failure we saw: connections don't respect the timeout, and hang due to a postgres incident, and then never recover.
Unfortunately, I did not find out any logs either in that incident.
However, a few days after that, which fortunately had logs, I found that a single job can slowly cause multiple pods to execute with 100% CPU. #18489 (comment 2095573446). I am surprised that we allow Sidekiq jobs to run forever.
@tkuah one of the things that's quite different this time is the Sidekiq pods that stopped processing jobs in this incident (what I've been referring to as "senescence"). This behaviour had major downstream impacts, such as Redis-Tracechunks filling up and failing.
If you take the graphs from my comment in #18538 (comment 2098953138) and apply them to last week's incident, they don't show this behaviour.
The only time during #18489 (closed) that pods were not processing was during the postgres incident, when the database was physically unavailable. For this incident, there inactivity persisted long past this time and made the incident much worse.
I am surprised that we allow Sidekiq jobs to run forever.
It doesn't appear that the Sidekiq workers were stuck running jobs. Rather they appear to be stuck between fetching jobs Incorrect: there were 25 stuck jobs, see #18538 (comment 2099286926)
@stanhu@andrewn I downloaded the logs for this job. Out of the 1031 jobs started, 1006 was completed (either drop, fail, or done). 25 jobs start but did not seem to be complete:
@tkuah I think we should definitely review the performance of the HPA as it certainly didn't help the situation. One of the confounding factors was the the HPA didn't kick in because so many of the pods were running at low CPU (see the lower-cpu cohort in graph 2 in this comment #18538 (comment 2098953138)). I know there have been previous discussions around this which I don't have the full details of, but (intuitively) I do think that driving the Sidekiq HPA based on queue backlog would be better than driving it on CPU.
I downloaded the logs for this job. Out of the 1031 jobs started, 1006 was completed (either drop, fail, or done). 25 jobs start but did not seem to be complete:
Ok, nice @tkuah! This is helpful. This seems to tie in with the timeout hypothesis #18538 (comment 2099211758) that @stanhu posted. Perhaps these jobs were running when the pg connection's "froze".
The fact that they are a mixture of jobs indicates that this probably wasn't a poisoned payload.
I wonder if there is some thread-safety bug that manifests in ruby-pg when a PostgreSQL connection goes away, and the client needs to reconnect. ruby-pg makes use of libpg: https://www.postgresql.org/docs/current/libpq-async.html.
If we look into detail how set_client_encoding works:
pgconn_async_get_last_result() does something similar to pgconn_discard_results(), except it will read the data and return the result.
I suppose there might be scenario where the sockets are sitting idle for responses that never come, but I'm not seeing it yet.
If we disable encoding: unicode from database.yml as the maintainer suggested, we probably wouldn't see the error in set_client_encoding anymore. However, all other database queries also perform async I/O queries, so I wouldn't expect this to solve this particular issue. But I'm not sure.
I checked that the database.yml has TCP keepalives enabled, and it looks like PgBouncer is configured to use the system TCP keepalive settings.
Perhaps we should try to reproduce this on a test instance by creating a network blackhole between PgBouncer and PostgreSQL?
Of those 10 pods in the logs, only 1 (gitlab-sidekiq-catchall-v2-694fb57bd8-wnlz9) shows the behavior @andrewn highlighted in #18538 (comment 2098953138)
This shows several workers, but the distribution is very much like what we would expect anyway src. I think this means something is happening outside of the execution of a job that breaks the Sidekiq processes.
@tkuah@stanhu yeah, I agree. One thing to point out, which I think may well have an impact on this incident: the catchall fleet was literally starting to rollout as the Postgres incident started happening. As the sidekiq pods were coming online they were connecting to a less-the-ideal-state postgres cluster. The only deployment that was impacted in this way was catchall, and this may have have an effect on course of the incident.
I tried reproducing this issue today with a one node with Rails (puma and sidekiq), and another node with PgBouncer and PostgreSQL to no avail. I made sure the TCP keepalive settings matched what I think is on gitlab.com (#4089 (closed)):
I used 20.times { Chaos::DbSpinWorker.perform_async(300, 1) to launch workers (and set retry: false to prevent the jobs from running again) that would query the database every second.
I used iptables rules to drop incoming TCP traffic on the PgBouncer port (6432). Every job failed reliably with something like:
"exception.class":"ActiveRecord::StatementInvalid","exception.message":"PG::ConnectionBad: PQconsumeInput() could not receive data from server: Connection timed out\n","exception.backtrace":["activerecord (7.0.8.4) lib/active_record/connection_adapters/postgresql/database_statements.rb:48:in `exec'","activerecord (7.0.8.4) lib/active_record/connection_adapters/postgresql/database_statements.rb:48:in `block (2 levels) in execute'","activesupport (7.0.8.4) lib/active_support/concurrency/share_lock.rb:187:in `yield_shares'","activesupport (7.0.8.4) lib/active_support/dependencies/interlock.rb:41:in `permit_concurrent_loads'","activerecord (7.0.8.4) lib/active_record/connection_adapters/postgresql/database_statements.rb:47:in `block in execute'",
Bob posted some logs above that appeared to fail around the 375 s mark; I wonder if that is hitting some timeout:
It's interesting that we only got 12 of these hits; I would have expected hundreds more if things were working properly. These errors also have this backtrace, which suggests the load balancing code tried to mark the replica offline and reconnect:
@rehab and I were looking at examples of pods not processing any jobs. The one we looked at was gitlab-sidekiq-catchall-v2-694fb57bd8-lhtsf.
Looking at the full lifecycle of the pod, we see a lot of activity in the first 10 minutes, then a 5 minute gap, then another short burst of activity, followed by 2h40m of inactivity:
This suggests the worker threads getting slowly tied up between 12:48 and 12:56. There is a wide distribution of job classes, so it's likely not job specific.
We also picked one of those requeued job ids to see if it got picked up later. Indeed, jid1405b356fbf1806011076481 got picked up about 35 minutes later and finished very quickly:
One discrepancy we found though is that for this particular pod, the CPU utilization was low.
If we compare @andrewn's earlier example pod, gitlab-sidekiq-catchall-v2-694fb57bd8-swgqh against this one gitlab-sidekiq-catchall-v2-694fb57bd8-lhtsf, we see that they both exhibit the same pattern of sidekiq_jobs_completion_count flatlining:
So this raises the question: Is CPU utilization actually a good indicator? It appears that at least some of the pods not getting work done are busy on CPU.
@stanhu Dropping network traffic may not be an accurate simulation of what happened. The patroni host was still online. The kernel networking stack was still working. But several processes were hanging waiting on the block device.
Maybe sending a SIGSTOP to the postgres backend is a closer match?
Dropping network traffic may not be an accurate simulation of what happened. The patroni host was still online. The kernel networking stack was still working. But several processes were hanging waiting on the block device.
@igorwwwwwwwwwwwwwwwwwwww Right, I just wanted to try something relatively straightforward to confirm that the pg gem was able to detect network timeouts like this.
Maybe sending a SIGSTOP to the postgres backend is a closer match?
I will try that too. In my test environment, I just added a PostgreSQL replica, enabled object storage and live CI traces, and tried launching 10 simultaneous jobs. I tried using stress-ng to bog down all the CPUs and disk. So far I haven't gotten things to fail in the same way, but maybe we need to insert random delays in PostgreSQL.
It appears that at least some of the pods not getting work done are busy on CPU.
Interesting, also one thing to note is that once they get into this state, the total number of busy database client connections doesn't seem to change:
# Disconnects all connections in the pool, and clears the pool. # # The pool first tries to gain ownership of all connections. If unable to # do so within a timeout interval (default duration is # <tt>spec.db_config.checkout_timeout * 2</tt> seconds), then the pool is forcefully # disconnected without any regard for other connection owning threads.
# Discards all connections in the pool (even if they're currently # leased!), along with the pool itself. Any further interaction with the # pool (except #spec and #schema_cache) is undefined. # # See AbstractAdapter#discard! def discard! # :nodoc:
Let's say we have a connection pool of size 25. In a degraded state, it might take at least 50 seconds to do the connection checkout. But if the verify step is slow (e.g. slow logs indicate the SET call took over a second), this might increase to another 25 seconds. There may be other SET calls that take 1 second, so this disconnect could take minutes.
This might not be related since we're using PostgreSQL, but it is intriguing that the PostgreSQL maintainer said:
Thanks for the report! I wanted evidence that this wasn't a
ruby-pg-specific problem, so I set up a test case with
Python/psycopg2.
I was able to reproduce a hang when all of the following were true:
psycopg2's async mode was enabled
the client performs a PQconsumeInput/PQisBusy loop, waiting on
socket read events when the connection is busy (I used
psycopg2.extras.wait_select() for this)
the server splits a large message over many large TLS records
the server packs the final ReadyForQuery message into the same
record as the split message's final fragment
I believe we are using TLS, and perhaps these catchall jobs are making large database queries?
The issue on the replica started at 12:43 UTC, a couple of minutes after the deployment started and around the time the Sidekiq pods were getting restarted as part of that deployment (most probably a coincidental time as @igorwwwwwwwwwwwwwwwwwwww suggested). So as long as the bad replica was stuck on iowait, Sidekiq pods thought it was healthy.
Actually the first minute we see a Sidekiq pod marking a replica offline is at 12:53 UTC, which is kind of the peak of the iowait on the bad node, and 10 minutes after the node was struggling
I don't know if I'm relying too heavily on the reported times here, but we start seeing a recovery on the replica after the biggest chunk of pods mark it as offline at minute 12:56 UTC
@rehab Thanks, that helps me see when the database started to go sideways and when the pods detected an issue. The logs also show that some pods recovered just fine after marking the host offline and then online, but some did not.
@reprazent mentioned gitlab-sidekiq-catchall-v2-694fb57bd8-wnlz9 above. It's interesting that at 12:51:00 we see two Marking host as offline messages logged 2 milliseconds apart from each other:
At 12:55:01.01 there was one PQconsumeInput() ERROR: query_wait_timeout that failed after 376 seconds in Ci::ArchiveTraceWorker with the stack trace above. Again, notice that the stacktrace includes when the host is marked offline. source
If we look at jobs that started for that pod, we can see that after 12:48 new jobs started to taper off:
12:48:00 and on: The jobs attempted to check out a connection from that suspect host.
They tried issuing a SQL query.
Ci::ArchiveTraceWorker and some other job to mark the host as down via pool.disconnect!.
Rails ConnectionPool attempts to acquire all connections so that it can verify them. However, it's possible that it can't get them because all the jobs have already claimed them. Some of them may also be attempting to call pool.disconnect!.
Some ConnectionPool managed to acquire the connections (or timeouts occurred), but then it attempted to verify the connection by sending a SQL command.
That verification failed with a ConnectionBad exception, so ConnectionPool attempted to reconnect with an initial SET client_encoding.
This query timed out, and the job failed.
We don't know exactly what happened to the other jobs, however, without a thread backtrace.
Thinking out loud, I wonder if:
Multiple calls to pool.disconnect! caused some deadlock condition.
Ci::ArchiveTraceWorker and attempted to perform host disconnect, but since ConnectionPool attempted to reconnect all its connections, every other job was waiting for this to finish.
The reconnection attempts got stuck for some reason.
As I said earlier, it does seem to me that ConnectionPool#disconnect! doesn't do exactly what we want to do here, namely we don't care about verifying connections in the pool because we want them all to go away.
I wonder if we can simulate this issue by having 25 Sidekiq jobs all attempt to perform a SQL and force a host disconnect.
I wasn't able to reproduce a lockup, but I've confirmed that gitlab-org/gitlab!166231 (merged) should eliminate reconnection attempts when the host is marked down. This issue is actually fixed in Rails 7.1 already via https://github.com/rails/rails/pull/44576, so we can remove that workaround once we upgrade.