More information will be added as we investigate the issue.
For customers believed to be affected by this incident, please subscribe to this issue or monitor our status page for further updates.
Summary for CMOC notice / Exec summary:
Customer Impact: Degradation in overall apdex over weekend period, dancing around the SLO threshold
Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.
...
Note:
In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally.
By default, all information we can share, will be public, in accordance to our transparency value.
🤖 GitLab Bot 🤖changed title from The rails_primary_sql SLI of the patroni service (main stage) has an apdex violating SLO to 2022-02-13: The rails_primary_sql SLI of the patroni service (main stage) has an apdex violating SLO
changed title from The rails_primary_sql SLI of the patroni service (main stage) has an apdex violating SLO to 2022-02-13: The rails_primary_sql SLI of the patroni service (main stage) has an apdex violating SLO
The heavily scanned index is the table's primary key's index: error_tracking_errors_pkey
The following series of graphs show an example 5-hour timespan when the rails_primary_sql apdex dipped in association with the error_tracking feature category: 2022-02-13 06:00 - 11:00 UTC.
Apdex dip
For context, this is the shape of the apdex dip, for comparison with the other graphs.
High ratio of index tuples read versus index scans performed
On the primary db, the error_tracking_errors table's primary key index (error_tracking_errors_pkey) is the only very active index. This roughly matches the shape of the apdex regression shown above.
However, the rate of running index scans did not appear to increase when the rate of rows returned did. There are a few potential reasons for this. The simplest would be that a small percentage of index scans is matching many rows, but that would be a little unusual for a pkey index that's usually used for single-row lookups. We may find more clues about this in the slow query logs or in the periodic captures of pg_stat_activity.
UPDATE statements from endpoint /api/:version/error_tracking/collector/api/:id/store
Periodic snapshots of pg_stat_activity for the same 5-hour timespan highlighted in the above comment #6341 (comment 841046790) show us a sample of the many short-duration queries touching table error_tracking_errors.
It would also have shown any queries that ran for multiple minutes, but there were none: xact_age_s was always less than 14 seconds.
Some useful take-aways:
The shape of the timeline matches the pkey index's read rate (and correlates with the apdex dip). This result is consistent with there being many low-cardinality (perhaps single-row) index reads.
All of these sampled statements were single-row UPDATE statements, filtered by a single id value.
All of these sampled statements had marginalia indicating the query was being run by rails endpoint POST /api/:version/error_tracking/collector/api/:id/store. We can look in the rails logs for more info on hits to that endpoint, including its actual call rate.
Rails logs show these requests gradually became slower
Next, let's focus on the client-side rails logs to examine hits to endpoint POST /api/:version/error_tracking/collector/api/:id/store.
Filter to that endpoint, and only include requests that made at least one hit to the primary db. Use the same 5-hour timespan we have been examining above, so the results are directly comparable.
Results:
A subset of the requests gradually became slower as the apdex worsened.
The request rate did not increase when the apdex dropped.
2 projects are generating most of these requests. 1 of those projects had a fairly constant request rate that dipped along with the apdex.
This might potentially be associated with making numerous updates to a tiny number of rows. This idea is speculative at this point. So far the evidence suggests that single-row UPDATEs were driving the regression, and while the primary db does have better defense than the replica dbs against dense accumulations of dead index items, it does still incur some overhead. And the last graph in the following series shows that the slow-down was still on a timescale of milliseconds per query, which is plausible for having to walk many dead index items even without having to do heap-fetches to check row visibility. Again, just speculation so far, but it may be testable. For example, the abrupt recovery around 09:10 UTC may correlate with vacuum visiting that index.
Anyway, on to the data:
Stable overall write request rate to endpoint error_tracking/collector/api/:id/store
Top project's request rate dropped as response time worsened
The following graph counts the endpoint's requests grouped by project. This shows 2 projects are associated with most of these requests.
Consider this graph jointly with the next one. The next graph shows that the response time for these requests started increasing at the same time that the request rate for the green project started dropping.
Notes:
This green project (project id 31838683) had the largest request rate and is the only project whose request rate decreased when performance worsened.
This suggests its requests may have been affected and the client may be exhibiting backoff behavior as a reaction to the slow response times.
These requests may have caused the regression or may just be a victim.
The two peaks rise steadily. The abrupt drop at 09:10 UTC could potentially be due to either client behavior (e.g. stops updating the same rows) or server-side behavior (e.g. vacuum cleans up the index's pointers to dead tuples).
Only project 31838683
99th percentile peaks at 50 ms.
Excluding project 31838683
The shape is similar, but the scale is much smaller -- 99th percentile peaks at 7 ms.
The autovacuum logs show that the error_tracking_errors table was being autovacuumed frequently (due to the high change rate).
And as anticipated, at 09:10 UTC when endpoint response time abruptly returned to normal and the apdex recovered, vacuum had just removed 180K dead tuples.
If those dead tuples were accumulated on a frequently accessed id value, then all of the UPDATE (or other) statements that were looking up that id value would have had to walk a large list of pointers to dead heap tuples, and even with the benefit of dead tuple hints, if those 180K dead row-versions were concentrated on a much smaller number of primary key values, those values would become proportionally expensive. We do not have direct observations via profiling or discrete counters, but circumstantially it seems plausible since we know that many single-row UPDATE statements (not INSERT statements) were being run for a single project (possibly updating the same error record repeatedly).
In terms of practical impact, this workload may have mainly affected itself, not other customers. I think it is possible for this to escalate to the point where an overall db performance regression causes an outage (e.g. by making these requests slow enough and frequent enough to saturate the (shared) db connection pool), but that would be an extreme scenario. And the combination of autovacuum and statement timeouts provide pretty decent defense against that worst-case outcome.
From an application perspective, I wonder if we should limit the max number of updates for a single error_tracking_errors row. It we implement it, that limit could be pretty large (1K, maybe 10K) and should probably be optional or configurable. I'm mainly thinking of this as mitigation for a potential vector for malicious abuse.
Thanks @stanhu! That does look like the same thing -- the /envelope and /store endpoints both pass the extracted payload to CollectErrorService. I'll link these issues.
When I glanced over API::ErrorTracking::Collector and ErrorTracking::CollectErrorService, I didn't notice anything that looked like rate limiting, but I may have missed it.
Here's more direct evidence that a small number of individual rows were being frequently updated. These come from the slow logs, so it is a small subset of the actual UPDATE statements -- only the ones that were slower than 1 second (i.e. the long tail of the regression).
This log file contains 24 hours of postgres CSV logs ending at 2022-02-13 13:27 UTC, so it contains the 5-hour timespan we were examining above.
msmiley@patroni-v12-05-db-gprd.c.gitlab-production.internal:~$ sudo cat /var/log/gitlab/postgresql/postgresql.csv.1 | grep 'error_tracking_errors' | grep 'endpoint_id:POST /api/:version/error_tracking/collector/api/:id/' | grep -P -o -e 'WHERE ""error_tracking_errors"".""id"" = \d+' | sort | uniq -c | sort -rn | head -n 30 43059 WHERE ""error_tracking_errors"".""id"" = 70398 3663 WHERE ""error_tracking_errors"".""id"" = 67419 2802 WHERE ""error_tracking_errors"".""id"" = 71017 1249 WHERE ""error_tracking_errors"".""id"" = 67751 11 WHERE ""error_tracking_errors"".""id"" = 67435 6 WHERE ""error_tracking_errors"".""id"" = 67428 3 WHERE ""error_tracking_errors"".""id"" = 67442 1 WHERE ""error_tracking_errors"".""id"" = 71241 1 WHERE ""error_tracking_errors"".""id"" = 67418
Oh goodness, why are we using a bitmap index scan for a single-row lookup on a primary key index?
gitlabhq_production=# explain select * from error_tracking_errors where id = 70398 ; QUERY PLAN ----------------------------------------------------------------------------------------- Bitmap Heap Scan on error_tracking_errors (cost=1.92..3.43 rows=1 width=203) Recheck Cond: (id = 70398) -> Bitmap Index Scan on error_tracking_errors_pkey (cost=0.00..1.92 rows=1 width=0) Index Cond: (id = 70398)(4 rows)
So in this case there is a tiny difference in the estimated cost of a normal Index Scan (3.84) versus a Bitmap Index Scan (3.43).
The actual wallclock time is currently quick for both and faster for the Bitmap Index Scan. The planning time for both is 0.5 ms, and the execution time is 0.3 versus 0.6 ms.
gitlabhq_production=# explain ( analyze, buffers ) select * from error_tracking_errors where id = 70398 ; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on error_tracking_errors (cost=1.92..3.43 rows=1 width=203) (actual time=0.214..0.254 rows=1 loops=1) Recheck Cond: (id = 70398) Heap Blocks: exact=1 Buffers: shared hit=72 -> Bitmap Index Scan on error_tracking_errors_pkey (cost=0.00..1.92 rows=1 width=0) (actual time=0.029..0.029 rows=407 loops=1) Index Cond: (id = 70398) Buffers: shared hit=5 Planning Time: 0.527 ms Execution Time: 0.275 ms(9 rows)gitlabhq_production=# set enable_bitmapscan = off ;SETgitlabhq_production=# explain ( analyze, buffers ) select * from error_tracking_errors where id = 70398 ; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------- Index Scan using error_tracking_errors_pkey on error_tracking_errors (cost=0.41..3.84 rows=1 width=203) (actual time=0.387..0.536 rows=1 loops=1) Index Cond: (id = 70398) Buffers: shared hit=117 Planning Time: 0.512 ms Execution Time: 0.552 ms(5 rows)gitlabhq_production=# reset all ;RESET
I was wondering earlier if we might be using a bitmap index scan, because the index usage statistics showed a much larger rate of increase for idx_tup_read than for idx_tup_fetch. Bitmap index scans do not increment idx_tup_fetch.
The column statistics look reasonable. The correctly reflect that the column's values are unique, and the histogram shows a plausible distribution of values. Our example value (70398) is in a bucket with 5 other values (lower/upper bounds of 70396 and 70401), and the nearby buckets share similar widths. Moreover, the planner is correctly estimating that approximately 1 row should match the filter expression.
The table-level statistics show that there is a mean row density of less than 1 live row per 8 KB page. That unusual state may be part of what's influencing the planner to give a slightly smaller estimated cost to the bitmap index scan. That's just a guess, I haven't run the numbers through the costing model.
I'm going to stop poking at the planner's decision here -- both options seem reasonable.
I still think the main issue is that we accumulated a lot of dead tuples on a hotly used index item, driving up the duration of queries that had to traverse that index item. (It seems plausible that there were at least a few of these hot keys; I'm just focusing on 70398 as a representative example.)
@igorwwwwwwwwwwwwwwwwwwww pinged me over the weekend on how I would go about diagnosing a slow-burn alert like this.
Since then, I think @msmiley has done an incredible job of getting to the cause of the issue, so I don't think there's much I need to add to that excellent analysis.
The only thing I'll add is around some starting points to an investigation like this (I think this is Igor's original ask to me)
The rails_primary_sql SLI is constructed in a fairly unusual manner, using Rails metrics rather than Postgres metrics, and this is often a clue as to where to start the investigation.
From this definition, we know that Rails requests taking longer than 50ms (0.05s) will count as errors on the rails_primary_sql apdex.
We also know that this only concerns the primary database (it's in the name of the SLI, the description, and the name of the underlying metric, gitlab_sql_primary_duration_seconds_bucket).
Since we know this is reported from Rails, I'll jump straight into the Rails logs
Rails logs a field, json.db_primary_duration_s. One thing to be aware of is that this metric is for sum of all primary SQL statements executed within the request, whereas the apdex uses each individual statement. In other words, when searching for requests that would lead to this error burn, we may need a higher threshold than 50ms. At a stab, I'll start with 1s.
Lets plot the number of requests with db_primary_duration_s > 1s and see how that looks (using a 4 day period).
This looks like a classic slowburn type situation, Could this lead us to the cause? source
Let's break that down by feature_category to see if these a particular feature that's generating this error burn: source
Very clearly, this is an issue with the error_tracking feature.
Now, lets check if this is a particular client? Filter for only feature_category=error_tracking, and then break the chart down by json.meta.client_id.
Yep, it's mostly one client. For the overall stability of the platform, it might be worth blocking this client while performing ongoing investigation.
I don't think this necessarily adds anything new over what @msmiley has investigated, just writing down the technique I use for digging into these types of issues.