2021-06-24 The shared_runner_queues SLI of the ci-runners service (`main` stage) has an apdex violating SLO

added IncidentActive Source::IMAIncidentDeclare incident severity2 labels

assigned to @alejandro, @brentnewton, and @sloyd

changed the severity to High - S2

Slack channel here.

Production checks fail because there are blockers

Production has active incidents

#4972 (closed)

Production has no change requests in progress

GitLab Deployment Health Status - overview

cny-api
cny-git
cny-web
main-api
main-git
main-sidekiq
main-web

no active deployment

HUGE dip in apdex:

https://dashboards.gitlab.net/d/ci-runners-main/ci-runners-overview?viewPanel=79474957&orgId=1&from=now-6h%2Fm&to=now-1m%2Fm&var-environment=gprd&var-type=ci-runners&var-stage=main&var-component=shared_runner_queues

Error spike that subsided:

https://dashboards.gitlab.net/d/ci-runners-main/ci-runners-overview?viewPanel=1563711449&orgId=1&from=now-6h%2Fm&to=now-1m%2Fm&var-environment=gprd&var-type=ci-runners&var-stage=main&var-component=shared_runner_queues

Is gitlab-org/gitlab#334255 (closed) related to this at all?

@stanhu very likely. ~~There were a bunch of statement timeouts on project_builds, which I noted below right before I saw your comment.~~ I was wrong and misread the query.

Actually to be clear, this was a pretty large query touching ci_running_builds as well.

Some of them were cancelled with terminating connection due to administrator command

In https://log.gprd.gitlab.net/goto/aa1657ff116dcdf92745ae30e8f1619c, I see a lot of messages relating to autovacuuming:

This might be okay, but I'm wondering if this is a symptom of another problem.

I think it is fine. I don't know enough about it to say definitively, though. Since that table is very active, it will get autovacuumed more often.

It's curious though that the uptick correlates with the apdex drop, though.

/cc: @grzesiek

If we filter out those autovacuum messages in https://log.gprd.gitlab.net/goto/7b0a771e789f9ea37ae9639c15232771, we see slow queries with ci_running_builds:

What's interesting is that these queries were timing out on the replicas, and never on the primary (patroni-v12-02):

added NeedsRootCause label

added NeedsService label

added NeedsCorrectiveActions label

~~There was a large jump in Postgres statement timeouts on the project_builds table.~~ I was wrong and misread the query.

EDIT: Actually to be clear, this was a pretty large query touching ci_running_builds as well.

added ServiceCI Runners label

changed the description

The apdex has fully recovered so I'm marking this as mitigated.

added IncidentMitigated label and removed IncidentActive label

removed NeedsService label

assigned to @ahanselka and @mchacon3

changed the description

The queries that were timing out were all similar to the following:

WITH "project_builds" AS MATERIALIZED
  (SELECT "ci_running_builds"."project_id", COUNT(*) AS running_builds
   FROM "ci_running_builds"
   WHERE "ci_running_builds"."runner_type" = $1
   GROUP BY "ci_running_builds"."project_id")
SELECT "ci_pending_builds"."build_id"
FROM "ci_pending_builds"
INNER JOIN projects ON ci_pending_builds.project_id = projects.id
LEFT JOIN project_features ON ci_pending_builds.project_id = project_features.project_id
LEFT JOIN project_builds ON ci_pending_builds.project_id = project_builds.project_id
WHERE "projects"."shared_runners_enabled" = $2
  AND "projects"."pending_delete" = $3
  AND (project_features.builds_access_level IS NULL
       OR project_features.builds_access_level > $4)
  AND ("projects"."visibility_level" = $5
       OR EXISTS
         (WITH RECURSIVE "base_and_ancestors" AS (
                                                    (SELECT "namespaces".*
                                                     FROM "namespaces"
                                                     WHERE (namespaces.id = projects.namespace_id))
                                                  UNION
                                                    (SELECT "namespaces".*
                                                     FROM "namespaces",
                                                          "base_and_ancestors"
                                                     WHERE "namespaces"."id" = "base_and_ancestors"."parent_id")) SELECT $6
          FROM "base_and_ancestors" AS "namespaces"
          LEFT JOIN namespace_statistics ON namespace_statistics.namespace_id = namespaces.id
          WHERE "namespaces"."parent_id" IS NULL
            AND (COALESCE(namespaces.shared_runners_minutes_limit, $7, $8) = $9
                 OR COALESCE(namespace_statistics.shared_runners_seconds, $10) < COALESCE((namespaces.shared_runners_minutes_limit + COALESCE(namespaces.extra_shared_runners_minutes_limit, $11)), ($12 + COALESCE(namespaces.extra_shared_runners_minutes_limit, $13)), $14) * $15)))
  AND (NOT EXISTS
         (SELECT $16
          FROM "taggings"
          WHERE "taggings"."taggable_type" = $17
            AND "taggings"."context" = $18
            AND (taggable_id = "ci_pending_builds"."build_id")
            AND "taggings"."tag_id" NOT IN ($19,
                                            $20,
                                            $21,
                                            $22,
                                            $23,
                                            $24,
                                            $25,
                                            $26,
                                            $27,
                                            $28)))
ORDER BY COALESCE(project_builds.running_builds, $29) ASC, ci_pending_builds.build_id ASC

I noticed some of the logs suggested that someone might have run an EXPLAIN ANALYZE while this incident was in progress. Did we have that output?

@stanhu I don't know. Do you have a link to logs where we think someone might have run that? I don't think that output would be logged anywhere.

The output wouldn't be to logs, but this error message (from the screenshot I linked above) looks like it might have been from someone typing on the database console?

Ohhh, that was me. It is unrelated as I did it in response to this. And it failed, so I didn't try again.

Yeah, for security reasons, the parameters are filtered in the slow logs in Kibana. In the future, you would need to get the exact query from the CSV directly.

As it should be. I don't think this would have helped anyway, just was trying to get a better idea of what was happening. Thanks!

changed the description

Posting screenshots to show the graphs and correlations.

Statement timeouts during that time:

Autovacuum for ci_running_builds:

Autovacuum for ci_pending_builds:

https://dashboards.gitlab.net/d/000000144/postgresql-overview?orgId=1&from=1624535097801&to=1624548474643 shows that AccessShareLock (table lock) was piling up:

Was there some migration running during this time?

At 13:17, a canary deploy started, so this would suggest perhaps a migration might have locked a table involved in the query (https://dashboards.gitlab.net/d/RZmbBr7mk/gitlab-triage?viewPanel=1197&orgId=1&from=1624535305688&to=1624544584812):

https://sentry.gitlab.net/gitlab/gitlabcom/releases/3406b5d7d2a/ and https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/jobs/4184506 show no migrations as far as I can tell. Maybe something in the app was holding a lock?

The PostgreSQL logs don't seem to have that many mentions of locks during this time period. Mostly these queries were canceled due to statement timeouts. Maybe there was a query estimate issue here. I wish we had the EXPLAIN output.

changed the description

mentioned in issue on-call-handovers#1783 (closed)

Autovacuum counts for "ci_running_builds" look interesting if we check the whole month – 0 till June 16, then 482 on June 16, then 12.4k on June 17, and so on, spiking https://log.gprd.gitlab.net/goto/b6f294bdca8b69e390fbb618812acf95:

However, this might be an issue with kibana – I couldn't find any autovacuum runs prior June 16. So, looks like we can rely only on data since that date.

That being said, acute spikes -- especially on June 23 and 24 -- are obvious.

And very frequent autovacuuming of "ci_running_builds" is still happening – 1,437 during last hour, ~50 per minute during the latest minutes https://log.gprd.gitlab.net/goto/bd3527f1f4a37c30871fbc725414e9cc:

ci_running_builds started to receive UPDATEs and DELETEs exactly at that time - June 16 https://dashboards.gitlab.net/d/000000167/postgresql-tuple-statistics?orgId=1&refresh=1m&var-env=gprd&var-instance=patroni-v12-02-db-gprd.c.gitlab-production.internal&var-db=gitlabhq_production&var-top_dead_tup=All&from=now-30d&to=now:

Tutple stats show that workload is "INSERT, then DELETE" – numbers of INSERTs and DELETEs match, a "queue-like" workload (a scalability anti-pattern for Postgres MVCC):

gitlabhq_production=# select now(), * from pg_stat_user_tables where relname = 'ci_running_builds';
-[ RECORD 1 ]-------+------------------------------
now                 | 2021-06-28 07:27:24.326397+00
relid               | 1047650476
schemaname          | public
relname             | ci_running_builds
seq_scan            | 4
seq_tup_read        | 0
idx_scan            | 13339155
idx_tup_fetch       | 4895623
n_tup_ins           | 4895984
n_tup_upd           | 0
n_tup_del           | 4895624
n_tup_hot_upd       | 0
n_live_tup          | 1717
n_dead_tup          | 242
n_mod_since_analyze | 276
last_vacuum         |
last_autovacuum     | 2021-06-28 07:26:57.703767+00
last_analyze        |
last_autoanalyze    | 2021-06-28 07:26:57.708829+00
vacuum_count        | 0
autovacuum_count    | 132519
analyze_count       | 0
autoanalyze_count   | 27645

From Postgres perspective, high frequency of autovacuum runs looks reasonable, nothing strange.

Bloat estimates:

table:

       Table       |  Size  |      Extra       |  Bloat estimate  |  Live   |        Last Vaccuum        | Fillfactor
-------------------+--------+------------------+------------------+---------+----------------------------+------------
 ci_running_builds | 360 kB | ~168 kB (46.67%) | ~168 kB (46.67%) | ~192 kB | 2021-06-28 07:56:28 (auto) |        100

indexes:

             Index (Table)             |  Size   |       Extra       |       Bloat       |  Live  | fillfactor
---------------------------------------+---------+-------------------+-------------------+--------+------------
 index_ci_running_builds_on_build_id  +| 2576 kB | ~2488 kB (96.58%) | ~2480 kB (96.27%) | ~96 kB |         90
   (ci_running_builds)                 |         |                   |                   |        |
 ci_running_builds_pkey               +| 2392 kB | ~2304 kB (96.32%) | ~2296 kB (95.99%) | ~96 kB |         90
   (ci_running_builds)                 |         |                   |                   |        |
 index_ci_running_builds_on_runner_id +| 2176 kB | ~2088 kB (95.96%) | ~2080 kB (95.59%) | ~96 kB |         90
   (ci_running_builds)                 |         |                   |                   |        |
 index_ci_running_builds_on_project_id+| 1584 kB | ~1496 kB (94.44%) | ~1488 kB (93.94%) | ~96 kB |         90
   (ci_running_builds)                 |         |                   |                   |        |

Other tables have tuples stats for dates prior June 16 – but ci_running_builds doesn't. So, most likely, INSERTs/DELETEs started on that date, causing frequent autovacuuming.

Quick guard for too frequent autovacuums (that don't save us from the bloat as we can see above and as one would expect, because this workload is an anti-pattern for Postgres MVCC) is raising autovacuum_vacuum_threshold and autovacuum_analyze_threshold from default 50 to a few thousand.

To scale better, this table and workload need redesign (e.g.: 3 partitions, INSERT-only workload to the "hot" partition + periodic TRUNCATE for "idle" partitions or just simple time-decay automated partitioning with partition pruning -- could be done with TimescaleDB).

Interesting, that during the period of the incident, the frequencies of INSERTs and DELETEs were not elevated https://dashboards.gitlab.net/d/000000167/postgresql-tuple-statistics?orgId=1&var-env=gprd&var-instance=patroni-v12-02-db-gprd.c.gitlab-production.internal&var-db=gitlabhq_production&var-top_dead_tup=All&from=1624536000000&to=1624550400000:

It seems that we saw a sudden increase in the shared runners queue size at the same time.

Had a call with @grzesiek (thanks!), some numbers – current and forecasted for 2 year horizon:

both ci_running_builds and ci_pending_builds are ~2000 rows now, will grow 10x (20k rows)
operation (INSERT/DELETE) rates: 10-20 per sec now, and 100-200 in 2y
INSERTs and DELETEs are happening in different transactions. However, some transactions have more than 1 statement (operations with ci_builds involved)
during of the "big query" is 100ms now – not very good, but kind of acceptable
from the plan of the "big query" we see that index scan is applied to [almost] all of them anyway – looks like there is no need for frequent stats recalculation (example: https://explain.depesz.com/s/HoDO) -- we can consider tuning autovacuum to have much, mush less frequent runs (both for VACUUM and ANALYZE parts) // <-- this is for short term ideas
there are business requirements that lead to the fact that no rows older than 24h (by created_at) are allowed in the both tables -- we can rely on it in the future, choosing better strategy // <--this is for longer term ideas

This data is supposed to be useful to mitigate the problem of frequent autovacuums and frequent stats recalculation (a few times per minute during workday periods without spikes). However, there is still the question of what have caused so huge spikes on June 24.

Double-checking the thesis that we had a lot of autovacuum runs for ci_[pending|running]_builds during the incident (starting June 24, ~13:40 UTC) that @stanhu raised above – cannot confirm it. The screenshot comment #4972 (comment 610881804) looks like it's about autovacuums for ci_running_builds, but those bars at ~13:40 include not only autovacuums – the majority of them are statement-timeouts. The query in kibana was simply ci_running_builds (https://log.gprd.gitlab.net/goto/aa1657ff116dcdf92745ae30e8f1619c):

if we "zoom" into one of the first high bars, we see that statement-timeouts are the majority of those log entries:

If we search for autovacuum runs particularly -- they increased more than 1 hour later https://log.gprd.gitlab.net/goto/10e8179b03d57639043af8f988b69a0c:

Cc @grzesiek @vitabaks @Finotto @stanhu @ahanselka @abrandl – this is important, high autovacuum frequency is not directly related to the problem. However, I still think that we might be dealing with a "plan flip" again. But I do doubt it's related to the stats of ci_[pending|running]_builds – as we see from the plans, we scan all ~2k rows anyway (example: https://explain.depesz.com/s/HoDO), so the actuality of the stats for these small tables is not really important. So it might be a flip related to a recalculation of stats on some large table.

Conclusion & proposal

Frequent autovacuum on ci_[pending|running]_builds turned out to be a red herring.
We should speeding up consideration of use of auto_explain as a tool helping us capture bad plans when statements time out. Issue: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/7596 – @Finotto let's reconsider, we do need it for such cases.

@NikolayS FWIW, a vacuum on ci_builds finished right about the time the statement timeouts stopped showing up: https://log.gprd.gitlab.net/goto/2c7ddba77ea55bc81e45198203bfde77

We have seen timely coincidences like this before, where vacuum on ci_builds finishing was closely related (time-wise) to another pathological symptom (statement timeouts) coming to an end.

The interesting part is that the query called out here (https://explain.depesz.com/s/HoDO) doesn't use ci_builds. Whether or not there is a causality I'm not sure, but wanted to say we've seen this pattern before.

The interesting part is that the query called out here (https://explain.depesz.com/s/HoDO) doesn't use ci_builds. Whether or not there is a causality I'm not sure, but wanted to say we've seen this pattern before.

@abrandl the ci_builds table is being used when INSERTING and DELETING from ci_pending/running_builds because we perform these operations in transaction with UPDATE ci_builds SET during build status transitions (state machine).

Could this affect SELECTs?

@NikolayS Thanks for that analysis! A plan flip does sound plausible, which is why I was hoping we could get the EXPLAIN output from #4972 (comment 611046098). Right now it's not easy for an SRE to figure out the exact query to run; https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/7596 would help a lot!

I wonder if this was actually a database-wide problem, not really something that was CI-specific. We just by coincidence alert on the outcome of the builds queuing query degradation.

A few interesting graphs from the Patroni overview dashboard from that time:

We can see a clear drop in Patroni apdex:

Other graphs show significant increase in active connections and connection pool saturation leading to a high wait time:

/cc @NikolayS @stanhu

On the other hand the maginalia sampler show that mostly queuing queries were affected. Not sure why the rails_sql_latency per category shows other categories too.

I see a significant spike in LWLock and IPC wait type (not sure what it means):

I see that we started waiting on "IPC" before we waited on "LWLock". Does it mean something @NikolayS?

Some definitions from docs:

LWLock: The backend is waiting for a lightweight lock. Each such lock protects a particular data structure in shared memory. wait_event will contain a name identifying the purpose of the lightweight lock. (Some locks have specific names; others are part of a group of locks each with a similar purpose.)

IPC: The server process is waiting for some activity from another process in the server. wait_event will identify the specific wait point.

How can I access logs to see exactly on what wait LWLock / IPC events we had to wait?

A few other graphs that unfortunately still do not explain this:

@grzesiek thank you for the additional monitoring pictures.

In general, all this is showing normal behavior when we have very slow queries (which is not normal of course, and reasons are still not fully clear) -- and, as a result, a lot of more active sessions.

How can I access logs to see exactly on what wait LWLock / IPC events we had to wait?

as you mentioned above, this is sampled data -- based on observing pg_stat_activity, it has columns wait_event and wait_event_type. LWLock indicates work with the buffer pool, IPC – inter-process communication. Note, that duration of those wait events is unknown, it may be VERY brief, we just observed samples, once per N seconds, and saw, how many backends were "sitting" in which state.

In my opinion, the main hypothesis is still a plan flip.

Spike in ACCESS SHARE locks – again, normal when we have degraded performance for specific SELECTs:

SELECTs (WITHs with SELECTs) lasted much longer, so they were holding these locks much longer, we see it as a spike.

Ok, thanks @NikolayS! What are the next steps? Do we need to modify our infra to log plans for slow queries and wait for another incident to progress with this investigation?

@grzesiek we need to prioritize the issue: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/7596 to gather more info.

Interesting spike visible on Haproxy dashboard, related to API rate limits, happened just a minute before the query plan flip / incident.

Further evidence of rate limiting on Haproxy level:

Unfortunately the requests were dropped by Haproxy and we can't find logs for them.

mentioned in issue #5047 (closed)

marked this issue as related to #5047 (closed)

Marking this incident as resolved, if we think there needs to be more investigation let's break this out into a new issue. If it would benefit from a sync discussion feel free to also add a ~"review-requested" label and we can put it on the next review agenda.

closed

added IncidentResolved label and removed IncidentMitigated label

mentioned in issue gitlab-com/www-gitlab-com#12194 (closed)

mentioned in issue gitlab-org/gitlab#338346 (closed)

marked this issue as related to gitlab-org/gitlab#338411 (closed)

marked this issue as related to gitlab-org/gitlab#338254 (closed)

added RootCauseSaturation label and removed NeedsRootCause label

mentioned in issue reliability-reports#87 (closed)

2021-06-24 The shared_runner_queues SLI of the ci-runners service (`main` stage) has an apdex violating SLO

Current Status

Timeline

Corrective Actions

Incident Review

Summary

Metrics

Customer Impact

What were the root causes?

Incident Response Analysis

Post Incident Analysis

Lessons Learned

Guidelines

Resources

Child items ...

Activity

Conclusion & proposal

2021-06-24 The shared_runner_queues SLI of the ci-runners service (`main` stage) has an apdex violating SLO

Current Status

Timeline

Corrective Actions

Incident Review

Summary

Metrics

Customer Impact

What were the root causes?

Incident Response Analysis

Post Incident Analysis

Lessons Learned

Guidelines

Resources

Relates to

Activity

Conclusion & proposal