2022-02-13: The rails_primary_sql SLI of the patroni service (`main` stage) has an apdex violating SLO

added IncidentActive ServicePostgres Source::IMAIncidentDeclare incident severity3 labels

changed the severity to Medium - S3

Slack channel here.

Production checks pass.

Production has no active severity1 / severity2 incidents

Production has no change requests in progress

GitLab Deployment Health Status - overview

All services healthy

no active deployment

Rollback of gprd is available:

Current: 14.8.202202111520

Target: 14.8.202202111220

Comparison: https://gitlab.com/gitlab-org/security/gitlab/-/compare/6510a18c7c93f68383f705eb6cfdce36dd4bb53d...720a11626bd1a720dd784bfd539c1b64fead57c4

This appears to have been slowly getting worse over the last 2 days, dancing around the SLO threshold:

https://dashboards.gitlab.net/d/patroni-main/patroni-overview?from=now-7d&to=now&var-environment=gprd&var-stage=main&orgId=1

It's possible that we have some new slow queries in our background noise, and due to the lower overall traffic at weekends, they are surfacing now.

added NeedsRootCause label

changed title from The rails_primary_sql SLI of the patroni service (main stage) has an apdex violating SLO to 2022-02-13: The rails_primary_sql SLI of the patroni service (main stage) has an apdex violating SLO

Elevated index tuple reads on error_tracking_errors:

https://dashboards.gitlab.net/d/000000167/postgresql-tuple-statistics?orgId=1&refresh=1m&var-env=gprd&var-instance=patroni-v12-05-db-gprd.c.gitlab-production.internal&var-db=gitlabhq_production&var-top_dead_tup=All&from=now-6h&to=now

Possibly related to gitlab-org/gitlab#329596 (closed).

See also: gitlab-org/gitlab!80287 (closed).

cc @bmarnane @sri19

Since the platform seems stable, I'm going to create a 24h silence for the 6h apdex alert.

added IncidentMitigated label and removed IncidentActive NeedsRootCause labels

Marking as mitigated as there appears to be no severe user impact. We will still need to investigate and understand this though.

Active Alert Silences Related to this Issue, Week of 2022-02-07

aggregation=component alert_class=slo_violation alert_type=symptom component=rails_primary_sql env=gprd environment=gprd feature_category=not_owned monitor=global pager=pagerduty rules_domain=general severity=s2 sli_type=apdex slo_alert=yes stage=main tier=db type=patroni user_impacting=yes window=6h alertname=PatroniServiceRailsPrimarySqlApdexSLOViolation since 2022-02-13 08:16Z

Reported by Helicopter

added alertmanager-silence label

added NeedsRootCause label

mentioned in issue on-call-handovers#2483 (closed)

The heavily scanned index is the table's primary key's index: error_tracking_errors_pkey

The following series of graphs show an example 5-hour timespan when the rails_primary_sql apdex dipped in association with the error_tracking feature category: 2022-02-13 06:00 - 11:00 UTC.

Apdex dip

For context, this is the shape of the apdex dip, for comparison with the other graphs.

source

Feature category `error_tracking` is correlated with the apdex dip

Mean SQL latency spikes only for feature category error_tracking:

source

Apdex attribution by feature category, with error_tracking selected to show its shape more cleanly:

source

High ratio of index tuples read versus index scans performed

On the primary db, the error_tracking_errors table's primary key index (error_tracking_errors_pkey) is the only very active index. This roughly matches the shape of the apdex regression shown above.

However, the rate of running index scans did not appear to increase when the rate of rows returned did. There are a few potential reasons for this. The simplest would be that a small percentage of index scans is matching many rows, but that would be a little unusual for a pkey index that's usually used for single-row lookups. We may find more clues about this in the slow query logs or in the periodic captures of pg_stat_activity.

Rows returned by index scans

source

Count of index scans

source

UPDATE statements from endpoint `/api/:version/error_tracking/collector/api/:id/store`

Periodic snapshots of pg_stat_activity for the same 5-hour timespan highlighted in the above comment #6341 (comment 841046790) show us a sample of the many short-duration queries touching table error_tracking_errors.

It would also have shown any queries that ran for multiple minutes, but there were none: xact_age_s was always less than 14 seconds.

Some useful take-aways:

The shape of the timeline matches the pkey index's read rate (and correlates with the apdex dip). This result is consistent with there being many low-cardinality (perhaps single-row) index reads.
All of these sampled statements were single-row UPDATE statements, filtered by a single id value.
All of these sampled statements had marginalia indicating the query was being run by rails endpoint POST /api/:version/error_tracking/collector/api/:id/store. We can look in the rails logs for more info on hits to that endpoint, including its actual call rate.

https://log.gprd.gitlab.net/goto/fec823b0-8cf8-11ec-9dd2-93d354bef8e7

Rails logs show these requests gradually became slower

Next, let's focus on the client-side rails logs to examine hits to endpoint POST /api/:version/error_tracking/collector/api/:id/store.

Filter to that endpoint, and only include requests that made at least one hit to the primary db. Use the same 5-hour timespan we have been examining above, so the results are directly comparable.

Results:

A subset of the requests gradually became slower as the apdex worsened.
The request rate did not increase when the apdex dropped.
2 projects are generating most of these requests. 1 of those projects had a fairly constant request rate that dipped along with the apdex.

This might potentially be associated with making numerous updates to a tiny number of rows. This idea is speculative at this point. So far the evidence suggests that single-row UPDATEs were driving the regression, and while the primary db does have better defense than the replica dbs against dense accumulations of dead index items, it does still incur some overhead. And the last graph in the following series shows that the slow-down was still on a timescale of milliseconds per query, which is plausible for having to walk many dead index items even without having to do heap-fetches to check row visibility. Again, just speculation so far, but it may be testable. For example, the abrupt recovery around 09:10 UTC may correlate with vacuum visiting that index.

Anyway, on to the data:

Stable overall write request rate to endpoint `error_tracking/collector/api/:id/store`

https://log.gprd.gitlab.net/goto/59fe7ba0-8cfc-11ec-a649-b7cbb8e4f62e

Top project's request rate dropped as response time worsened

The following graph counts the endpoint's requests grouped by project. This shows 2 projects are associated with most of these requests.

Consider this graph jointly with the next one. The next graph shows that the response time for these requests started increasing at the same time that the request rate for the green project started dropping.

Notes:

This green project (project id 31838683) had the largest request rate and is the only project whose request rate decreased when performance worsened.
This suggests its requests may have been affected and the client may be exhibiting backoff behavior as a reaction to the slow response times.
These requests may have caused the regression or may just be a victim.

https://log.gprd.gitlab.net/goto/63409a10-8cff-11ec-a649-b7cbb8e4f62e

Endpoint response times climb mostly for project `31838683`

https://log.gprd.gitlab.net/goto/cbc39900-8d01-11ec-9dd2-93d354bef8e7

Overall

The two peaks rise steadily. The abrupt drop at 09:10 UTC could potentially be due to either client behavior (e.g. stops updating the same rows) or server-side behavior (e.g. vacuum cleans up the index's pointers to dead tuples).

Only project `31838683`

99th percentile peaks at 50 ms.

Excluding project `31838683`

The shape is similar, but the scale is much smaller -- 99th percentile peaks at 7 ms.

Dense dead tuple accumulation seems likely

The autovacuum logs show that the error_tracking_errors table was being autovacuumed frequently (due to the high change rate).

And as anticipated, at 09:10 UTC when endpoint response time abruptly returned to normal and the apdex recovered, vacuum had just removed 180K dead tuples.

If those dead tuples were accumulated on a frequently accessed id value, then all of the UPDATE (or other) statements that were looking up that id value would have had to walk a large list of pointers to dead heap tuples, and even with the benefit of dead tuple hints, if those 180K dead row-versions were concentrated on a much smaller number of primary key values, those values would become proportionally expensive. We do not have direct observations via profiling or discrete counters, but circumstantially it seems plausible since we know that many single-row UPDATE statements (not INSERT statements) were being run for a single project (possibly updating the same error record repeatedly).

https://log.gprd.gitlab.net/goto/4e3e6e70-8d05-11ec-a649-b7cbb8e4f62e

In terms of practical impact, this workload may have mainly affected itself, not other customers. I think it is possible for this to escalate to the point where an overall db performance regression causes an outage (e.g. by making these requests slow enough and frequent enough to saturate the (shared) db connection pool), but that would be an extreme scenario. And the combination of autovacuum and statement timeouts provide pretty decent defense against that worst-case outcome.

Possible corrective action:

From an application perspective, I wonder if we should limit the max number of updates for a single error_tracking_errors row. It we implement it, that limit could be pretty large (1K, maybe 10K) and should probably be optional or configurable. I'm mainly thinking of this as mitigation for a potential vector for malicious abuse.

See https://gitlab.com/gitlab-org/gitlab/-/issues/352196. This looks like a repeat of that, though I thought we added rate limiting.

Thanks @stanhu! That does look like the same thing -- the /envelope and /store endpoints both pass the extracted payload to CollectErrorService. I'll link these issues.

When I glanced over API::ErrorTracking::Collector and ErrorTracking::CollectErrorService, I didn't notice anything that looked like rate limiting, but I may have missed it.

Right, these are different endpoints but effectively do the same thing. We may want to apply the same changes relating to #6295 (comment 834363211).

Here's more direct evidence that a small number of individual rows were being frequently updated. These come from the slow logs, so it is a small subset of the actual UPDATE statements -- only the ones that were slower than 1 second (i.e. the long tail of the regression).

This log file contains 24 hours of postgres CSV logs ending at 2022-02-13 13:27 UTC, so it contains the 5-hour timespan we were examining above.

msmiley@patroni-v12-05-db-gprd.c.gitlab-production.internal:~$ sudo cat /var/log/gitlab/postgresql/postgresql.csv.1 | grep 'error_tracking_errors' | grep 'endpoint_id:POST /api/:version/error_tracking/collector/api/:id/' | grep -P -o -e 'WHERE ""error_tracking_errors"".""id"" = \d+' | sort | uniq -c | sort -rn | head -n 30
  43059 WHERE ""error_tracking_errors"".""id"" = 70398
   3663 WHERE ""error_tracking_errors"".""id"" = 67419
   2802 WHERE ""error_tracking_errors"".""id"" = 71017
   1249 WHERE ""error_tracking_errors"".""id"" = 67751
     11 WHERE ""error_tracking_errors"".""id"" = 67435
      6 WHERE ""error_tracking_errors"".""id"" = 67428
      3 WHERE ""error_tracking_errors"".""id"" = 67442
      1 WHERE ""error_tracking_errors"".""id"" = 71241
      1 WHERE ""error_tracking_errors"".""id"" = 67418

Oh goodness, why are we using a bitmap index scan for a single-row lookup on a primary key index?

gitlabhq_production=# explain select * from error_tracking_errors where id = 70398 ;
                                       QUERY PLAN                                        
-----------------------------------------------------------------------------------------
 Bitmap Heap Scan on error_tracking_errors  (cost=1.92..3.43 rows=1 width=203)
   Recheck Cond: (id = 70398)
   ->  Bitmap Index Scan on error_tracking_errors_pkey  (cost=0.00..1.92 rows=1 width=0)
         Index Cond: (id = 70398)
(4 rows)

I'm going to look at the statistics. Please no one run ANALYZE for a few minutes.

So in this case there is a tiny difference in the estimated cost of a normal Index Scan (3.84) versus a Bitmap Index Scan (3.43).

The actual wallclock time is currently quick for both and faster for the Bitmap Index Scan. The planning time for both is 0.5 ms, and the execution time is 0.3 versus 0.6 ms.

gitlabhq_production=# explain ( analyze, buffers ) select * from error_tracking_errors where id = 70398 ;
                                                             QUERY PLAN                                                              
-------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on error_tracking_errors  (cost=1.92..3.43 rows=1 width=203) (actual time=0.214..0.254 rows=1 loops=1)
   Recheck Cond: (id = 70398)
   Heap Blocks: exact=1
   Buffers: shared hit=72
   ->  Bitmap Index Scan on error_tracking_errors_pkey  (cost=0.00..1.92 rows=1 width=0) (actual time=0.029..0.029 rows=407 loops=1)
         Index Cond: (id = 70398)
         Buffers: shared hit=5
 Planning Time: 0.527 ms
 Execution Time: 0.275 ms
(9 rows)

gitlabhq_production=# set enable_bitmapscan = off ;
SET

gitlabhq_production=# explain ( analyze, buffers ) select * from error_tracking_errors where id = 70398 ;
                                                                     QUERY PLAN                                                                     
----------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using error_tracking_errors_pkey on error_tracking_errors  (cost=0.41..3.84 rows=1 width=203) (actual time=0.387..0.536 rows=1 loops=1)
   Index Cond: (id = 70398)
   Buffers: shared hit=117
 Planning Time: 0.512 ms
 Execution Time: 0.552 ms
(5 rows)

gitlabhq_production=# reset all ;
RESET

I was wondering earlier if we might be using a bitmap index scan, because the index usage statistics showed a much larger rate of increase for idx_tup_read than for idx_tup_fetch. Bitmap index scans do not increment idx_tup_fetch.

The column statistics look reasonable. The correctly reflect that the column's values are unique, and the histogram shows a plausible distribution of values. Our example value (70398) is in a bucket with 5 other values (lower/upper bounds of 70396 and 70401), and the nearby buckets share similar widths. Moreover, the planner is correctly estimating that approximately 1 row should match the filter expression.

The table-level statistics show that there is a mean row density of less than 1 live row per 8 KB page. That unusual state may be part of what's influencing the planner to give a slightly smaller estimated cost to the bitmap index scan. That's just a guess, I haven't run the numbers through the costing model.

I'm going to stop poking at the planner's decision here -- both options seem reasonable.

I still think the main issue is that we accumulated a lot of dead tuples on a hotly used index item, driving up the duration of queries that had to traverse that index item. (It seems plausible that there were at least a few of these hot keys; I'm just focusing on 70398 as a representative example.)

mentioned in issue on-call-handovers#2484 (closed)

Active Alert Silences Related to this Issue, Week of 2022-02-14

aggregation=component alert_class=slo_violation alert_type=symptom component=rails_primary_sql env=gprd environment=gprd feature_category=not_owned monitor=global pager=pagerduty rules_domain=general severity=s2 sli_type=apdex slo_alert=yes stage=main tier=db type=patroni user_impacting=yes window=6h alertname=PatroniServiceRailsPrimarySqlApdexSLOViolation since 2022-02-13 08:16Z

Reported by Helicopter

@igorwwwwwwwwwwwwwwwwwwww pinged me over the weekend on how I would go about diagnosing a slow-burn alert like this.

Since then, I think @msmiley has done an incredible job of getting to the cause of the issue, so I don't think there's much I need to add to that excellent analysis.

The only thing I'll add is around some starting points to an investigation like this (I think this is Igor's original ask to me)

For these types of database issues, there's a lot of information on debugging in https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/patroni/mapping_statements.md
The rails_primary_sql SLI is constructed in a fairly unusual manner, using Rails metrics rather than Postgres metrics, and this is often a clue as to where to start the investigation.
The SLI definition is:

source
From this definition, we know that Rails requests taking longer than 50ms (0.05s) will count as errors on the rails_primary_sql apdex.
We also know that this only concerns the primary database (it's in the name of the SLI, the description, and the name of the underlying metric, gitlab_sql_primary_duration_seconds_bucket).
Since we know this is reported from Rails, I'll jump straight into the Rails logs
Rails logs a field, json.db_primary_duration_s. One thing to be aware of is that this metric is for sum of all primary SQL statements executed within the request, whereas the apdex uses each individual statement. In other words, when searching for requests that would lead to this error burn, we may need a higher threshold than 50ms. At a stab, I'll start with 1s.
Lets plot the number of requests with db_primary_duration_s > 1s and see how that looks (using a 4 day period).
This looks like a classic slowburn type situation, Could this lead us to the cause?

source
Let's break that down by feature_category to see if these a particular feature that's generating this error burn:
source
Very clearly, this is an issue with the error_tracking feature.
Now, lets check if this is a particular client? Filter for only feature_category=error_tracking, and then break the chart down by json.meta.client_id.
source
Yep, it's mostly one client. For the overall stability of the platform, it might be worth blocking this client while performing ongoing investigation.

I don't think this necessarily adds anything new over what @msmiley has investigated, just writing down the technique I use for digging into these types of issues.

assigned to @bmarnane

@bmarnane Since this appears to be related to the new error_tracking feature, would you be able to find someone to own this incident?

Is this owned by incubation or should we reach out to devopsmonitor?

@igorwwwwwwwwwwwwwwwwwwww (as FYI) according to https://gitlab.com/gitlab-com/www-gitlab-com/-/blob/master/data/stages.yml, error_tracking is owned by grouprespond, cc @crystalpoole

added devopsmonitor grouprespond labels

assigned to @crystalpoole

unassigned @bmarnane

To prevent this problem again, as we did with #6295 (comment 834363211):

Cloudflare Firewall issue: https://gitlab.com/gitlab-com/gl-infra/cloudflare-firewall/-/issues/109
Config change for new limits: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/3487

mentioned in epic gitlab-org&7580 (closed)

marked this issue as related to #6388 (closed)

changed the description

mentioned in epic gitlab-org&7594

mentioned in issue gitlab-org/database-team/team-tasks#238 (closed)

@crystalpoole @igorwwwwwwwwwwwwwwwwwwww with https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6399, gitlab-org&7580 (closed) and https://gitlab.com/gitlab-org/gitlab/-/issues/352196 as corrective action's for this, tackling both the short and long term mitigation and fixes, do we still want to keep this incident open or can we mark it as IncidentResolved?

@rehab - I think we can mark resolved with https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6399 complete and gitlab-org&7580 (closed) in progress.

Thanks for confirming, Sam!

added IncidentResolved label and removed IncidentMitigated label

closed

changed the incident status to Resolved by closing the incident

mentioned in issue meks/sample-project-test#31

mentioned in issue gitlab-org/quality/triage-reports#6629 (closed)

mentioned in merge request infradev-report!33 (merged)

mentioned in issue reliability-reports#55 (closed)

2022-02-13: The rails_primary_sql SLI of the patroni service (`main` stage) has an apdex violating SLO

Incident DRI

Current Status

Summary for CMOC notice / Exec summary:

Timeline

Takeaways

Corrective Actions

Incident Review

Customer Impact

What were the root causes?

Incident Response Analysis

Post Incident Analysis

What went well?

Guidelines

Resources

Child items ...

Activity

Active Alert Silences Related to this Issue, Week of 2022-02-07

Apdex dip

Feature category `error_tracking` is correlated with the apdex dip

High ratio of index tuples read versus index scans performed

Rows returned by index scans

Count of index scans

UPDATE statements from endpoint `/api/:version/error_tracking/collector/api/:id/store`

Rails logs show these requests gradually became slower

Stable overall write request rate to endpoint `error_tracking/collector/api/:id/store`

Top project's request rate dropped as response time worsened

Endpoint response times climb mostly for project `31838683`

Overall

Only project `31838683`

Excluding project `31838683`

Dense dead tuple accumulation seems likely

Active Alert Silences Related to this Issue, Week of 2022-02-14

2022-02-13: The rails_primary_sql SLI of the patroni service (`main` stage) has an apdex violating SLO

Incident DRI

Current Status

Summary for CMOC notice / Exec summary:

Timeline

Takeaways

Corrective Actions

Incident Review

Customer Impact

What were the root causes?

Incident Response Analysis

Post Incident Analysis

What went well?

Guidelines

Resources

Relates to

Activity

Active Alert Silences Related to this Issue, Week of 2022-02-07

Apdex dip

Feature category error_tracking is correlated with the apdex dip

High ratio of index tuples read versus index scans performed

Rows returned by index scans

Count of index scans

UPDATE statements from endpoint /api/:version/error_tracking/collector/api/:id/store

Rails logs show these requests gradually became slower

Stable overall write request rate to endpoint error_tracking/collector/api/:id/store

Top project's request rate dropped as response time worsened

Endpoint response times climb mostly for project 31838683

Overall

Only project 31838683

Excluding project 31838683

Dense dead tuple accumulation seems likely

Active Alert Silences Related to this Issue, Week of 2022-02-14

Feature category `error_tracking` is correlated with the apdex dip

UPDATE statements from endpoint `/api/:version/error_tracking/collector/api/:id/store`

Stable overall write request rate to endpoint `error_tracking/collector/api/:id/store`

Endpoint response times climb mostly for project `31838683`

Only project `31838683`

Excluding project `31838683`