Database load spiked and this has dramatically increased error rates and degraded times on responses.
Automatic re-indexing was enabled over the weekend. A re-index of the routes table changed the statistics on the table. This resulted in much less efficient queries there. A manual re-analyze of the table fixed the issue. The automatic re-indexing has been disabled.
Outage from 2020-10-24 09:16 to 10:25.
Timeline
All times UTC.
2020-10-24
09:12 - starting to reindexindex_on_routes_lower_path
09:14 - swapped old and new index by name, deleting old index. This is causing table statistics to be off and makes queries slow.
09:15 - starting to get alerts and pages all over the place
09:19 - hphilipps declares incident in Slack using /incident declare command, paging IMOC and CMOC.
09:51 - started by trying to block particular URL deleting tags, but now blocking entire identified network which is sending troublesome requests. (Turns out as not related later.)
10:05 - DB load is low, but pg_bouncer_sync_replica_pool saturation is 100%
jwtController - main stage has a lot of timeouts - looking at blocking
Will affect registry
10:19 - the block to jwt_auth appears to have made a difference.
Things in pg_bouncer/patroni appear to be recovering.
10:21 - removing initial project and IP block as they did not appear to make a difference. Leaving block on jwt_auth
10:54 - suspecting automatic reindexing to have caused the issue, disabling the feature flag: /chatops run feature set database_reindexing false
11:08 - found that a rerun of statistics on the route table changed the query plan
11:12 - removed the temporary blocks of request paths that were related to the issues.
11:20 - all traffic is operating normally
Incident Review
Summary
An automatic DB re-indexing cronjob, which was enabled this weekend, lead to missing table statistics for an important index which made all queries related to that index very inefficient, slowing down the whole database. This caused an APDEX drop to around 50% for most services between 09:16 to 10:25 UTC (69 minutes) and for the container registry service until 11:20 (124 minutes).
Service(s) affected: All services depending on the DB
EOC got paged for increased backend error rates and Pingdom check failing.
How could detection time be improved?
Detection worked well.
How did we reach the point where we knew how to mitigate the impact?
After checking CloudFlare traffic and if we had a DB failover we started spending time looking for abusive requests and identifying responsible controllers. The pgbouncer dashboard controller panel and analysis of the dominating slow queries by OnGres helped us to identify and block the /jwt/auth endpoint as the main producer of DB load as a first mitigation. As @abrandl had informed the EOC about enabling the re-indexing cronjob the prior day (and it was mentioned in the oncall handover), we started to suspect a possible correlation and started looking into that. Checking the re-indexing cron job logs, @ahegyi and @jacobvosmaer-gitlab found that the index index_on_routes_lower_path just had been re-indexed when the incident started and OnGres confirmed that the query plan for the slow query stopped using the index because of missing statistics which made those queries very inefficient. OnGres finally fixed that by running ANALYZE on the index and we disabled the re-index feature flag to prevent further events.
How could time to mitigation be improved?
Better documentation on how to page OnGres.
Improvements and training on how to interpret slow queries and connect them to controllers and endpoints.
Alert if only a few queries are dominating the DB request times
Post Incident Analysis
How was the root cause diagnosed?
See above.
How could time to diagnosis be improved?
See above.
Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?
Because a functional index had missing statistics leading to a bad query plan.
Why was the index missing statistics?
Because an automatic re-indexing job was re-creating the index and swapping the new with the old index, but for functional indices this is leading to missing statistics without running ANALYZE immediately.
Why didn't we run ANALYZE during the automated re-indexing?
The reindexing job in staging was running without issues for a few days already, but as indices are selected randomly, there is a high chance that it didn't hit a functional index so far and going through all indices would have taken many weeks. We all had the assumption that if it works for some indices, that it also will work for the rest.
A lot of statement timeout (31918) with the same query happening on patroni-02
Query:
2020-10-24 10:33:56.707 GMT,"gitlab","gitlabhq_production",45379,"127.0.0.1:35348",5f9402d4.b143,6,"SELECT",2020-10-24 10:32:52 GMT,53/2400320742,0,ERROR,57014,"canceling statement due to statement timeout",,,,,,"SELECT ""projects"".""id"" AS t0_r0, ""projects"".""name"" AS t0_r1, ""projects"".""path"" AS t0_r2, ""projects"".""description"" AS t0_r3, ""projects"".""created_at"" AS t0_r4, ""projects"".""updated_at"" AS t0_r5, ""projects"".""creator_id"" AS t0_r6, ""projects"".""namespace_id"" AS t0_r7, ""projects"".""last_activity_at"" AS t0_r8, ""projects"".""import_url"" AS t0_r9, ""projects"".""visibility_level"" AS t0_r10, ""projects"".""archived"" AS t0_r11, ""projects"".""merge_requests_template"" AS t0_r12, ""projects"".""star_count"" AS t0_r13, ""projects"".""merge_requests_rebase_enabled"" AS t0_r14, ""projects"".""import_type"" AS t0_r15, ""projects"".""import_source"" AS t0_r16, ""projects"".""avatar"" AS t0_r17, ""projects"".""approvals_before_merge"" AS t0_r18, ""projects"".""reset_approvals_on_push"" AS t0_r19, ""projects"".""merge_requests_ff_only_enabled"" AS t0_r20, ""projects"".""issues_template"" AS t0_r21, ""projects"".""mirror"" AS t0_r22, ""projects"".""mirror_last_update_at"" AS t0_r23, ""projects"".""mirror_last_successful_update_at"" AS t0_r24, ""projects"".""mirror_user_id"" AS t0_r25, ""projects"".""shared_runners_enabled"" AS t0_r26, ""projects"".""runners_token"" AS t0_r27, ""projects"".""build_coverage_regex"" AS t0_r28, ""projects"".""build_allow_git_fetch"" AS t0_r29, ""projects"".""build_timeout"" AS t0_r30, ""projects"".""mirror_trigger_builds"" AS t0_r31, ""projects"".""public_builds"" AS t0_r32, ""projects"".""pending_delete"" AS t0_r33, ""projects"".""last_repository_check_failed"" AS t0_r34, ""projects"".""last_repository_check_at"" AS t0_r35, ""projects"".""container_registry_enabled"" AS t0_r36, ""projects"".""only_allow_merge_if_pipeline_succeeds"" AS t0_r37, ""projects"".""has_external_issue_tracker"" AS t0_r38, ""projects"".""repository_storage"" AS t0_r39, ""projects"".""request_access_enabled"" AS t0_r40, ""projects"".""has_external_wiki"" AS t0_r41, ""projects"".""repository_read_only"" AS t0_r42, ""projects"".""lfs_enabled"" AS t0_r43, ""projects"".""description_html"" AS t0_r44, ""projects"".""only_allow_merge_if_all_discussions_are_resolved"" AS t0_r45, ""projects"".""repository_size_limit"" AS t0_r46, ""projects"".""service_desk_enabled"" AS t0_r47, ""projects"".""printing_merge_request_link_enabled"" AS t0_r48, ""projects"".""auto_cancel_pending_pipelines"" AS t0_r49, ""projects"".""cached_markdown_version"" AS t0_r50, ""projects"".""last_repository_updated_at"" AS t0_r51, ""projects"".""ci_config_path"" AS t0_r52, ""projects"".""disable_overriding_approvers_per_merge_request"" AS t0_r53, ""projects"".""delete_error"" AS t0_r54, ""projects"".""storage_version"" AS t0_r55, ""projects"".""resolve_outdated_diff_discussions"" AS t0_r56, ""projects"".""remote_mirror_available_overridden"" AS t0_r57, ""projects"".""only_mirror_protected_branches"" AS t0_r58, ""projects"".""pull_mirror_available_overridden"" AS t0_r59, ""projects"".""jobs_cache_index"" AS t0_r60, ""projects"".""external_authorization_classification_label"" AS t0_r61, ""projects"".""mirror_overwrites_diverged_branches"" AS t0_r62, ""projects"".""external_webhook_token"" AS t0_r63, ""projects"".""pages_https_only"" AS t0_r64, ""projects"".""packages_enabled"" AS t0_r65, ""projects"".""merge_requests_author_approval"" AS t0_r66, ""projects"".""pool_repository_id"" AS t0_r67, ""projects"".""runners_token_encrypted"" AS t0_r68, ""projects"".""bfg_object_map"" AS t0_r69, ""projects"".""detected_repository_languages"" AS t0_r70, ""projects"".""merge_requests_disable_committers_approval"" AS t0_r71, ""projects"".""require_password_to_approve"" AS t0_r72, ""projects"".""emails_disabled"" AS t0_r73, ""projects"".""max_pages_size"" AS t0_r74, ""projects"".""max_artifacts_size"" AS t0_r75, ""projects"".""pull_mirror_branch_prefix"" AS t0_r76, ""projects"".""remove_source_branch_after_merge"" AS t0_r77, ""projects"".""marked_for_deletion_at"" AS t0_r78, ""projects"".""marked_for_deletion_by_user_id"" AS t0_r79, ""projects"".""suggestion_commit_message"" AS t0_r80, ""projects"".""autoclose_referenced_issues"" AS t0_r81, ""routes"".""id"" AS t1_r0, ""routes"".""source_id"" AS t1_r1, ""routes"".""source_type"" AS t1_r2, ""routes"".""path"" AS t1_r3, ""routes"".""created_at"" AS t1_r4, ""routes"".""updated_at"" AS t1_r5, ""routes"".""name"" AS t1_r6 FROM ""projects"" LEFT OUTER JOIN ""routes"" ON ""routes"".""source_type"" = 'Project' AND ""routes"".""source_id"" = ""projects"".""id"" WHERE ((LOWER(routes.path) = LOWER('e-rahn/e-rahn-idp-app'))) ORDER BY ""projects"".""id"" ASC LIMIT 1 /*application:web,correlation_id:LsvVSXfLrp8*/",,,"puma: cluster worker 14: 3833 [gitlab-puma-worker] - 10.220.2.4"
explain analyze SELECT "projects"."id" AS t0_r0, "projects"."name" AS t0_r1, "projects"."path" AS t0_r2, "projects"."description" AS t0_r3, "projects"."created_at" AS t0_r4, "projects"."updated_at" AS t0_r5, "projects"."creator_id" AS t0_r6, "projects"."namespace_id" AS t0_r7, "projects"."last_activity_at" AS t0_r8, "projects"."import_url" AS t0_r9, "projects"."visibility_level" AS t0_r10, "projects"."archived" AS t0_r11, "projects"."merge_requests_template" AS t0_r12, "projects"."star_count" AS t0_r13, "projects"."merge_requests_rebase_enabled" AS t0_r14, "projects"."import_type" AS t0_r15, "projects"."import_source" AS t0_r16, "projects"."avatar" AS t0_r17, "projects"."approvals_before_merge" AS t0_r18, "projects"."reset_approvals_on_push" AS t0_r19, "projects"."merge_requests_ff_only_enabled" AS t0_r20, "projects"."issues_template" AS t0_r21, "projects"."mirror" AS t0_r22, "projects"."mirror_last_update_at" AS t0_r23, "projects"."mirror_last_successful_update_at" AS t0_r24, "projects"."mirror_user_id" AS t0_r25, "projects"."shared_runners_enabled" AS t0_r26, "projects"."runners_token" AS t0_r27, "projects"."build_coverage_regex" AS t0_r28, "projects"."build_allow_git_fetch" AS t0_r29, "projects"."build_timeout" AS t0_r30, "projects"."mirror_trigger_builds" AS t0_r31, "projects"."public_builds" AS t0_r32, "projects"."pending_delete" AS t0_r33, "projects"."last_repository_check_failed" AS t0_r34, "projects"."last_repository_check_at" AS t0_r35, "projects"."container_registry_enabled" AS t0_r36, "projects"."only_allow_merge_if_pipeline_succeeds" AS t0_r37, "projects"."has_external_issue_tracker" AS t0_r38, "projects"."repository_storage" AS t0_r39, "projects"."request_access_enabled" AS t0_r40, "projects"."has_external_wiki" AS t0_r41, "projects"."repository_read_only" AS t0_r42, "projects"."lfs_enabled" AS t0_r43, "projects"."description_html" AS t0_r44, "projects"."only_allow_merge_if_all_discussions_are_resolved" AS t0_r45, "projects"."repository_size_limit" AS t0_r46, "projects"."service_desk_enabled" AS t0_r47, "projects"."printing_merge_request_link_enabled" AS t0_r48, "projects"."auto_cancel_pending_pipelines" AS t0_r49, "projects"."cached_markdown_version" AS t0_r50, "projects"."last_repository_updated_at" AS t0_r51, "projects"."ci_config_path" AS t0_r52, "projects"."disable_overriding_approvers_per_merge_request" AS t0_r53, "projects"."delete_error" AS t0_r54, "projects"."storage_version" AS t0_r55, "projects"."resolve_outdated_diff_discussions" AS t0_r56, "projects"."remote_mirror_available_overridden" AS t0_r57, "projects"."only_mirror_protected_branches" AS t0_r58, "projects"."pull_mirror_available_overridden" AS t0_r59, "projects"."jobs_cache_index" AS t0_r60, "projects"."external_authorization_classification_label" AS t0_r61, "projects"."mirror_overwrites_diverged_branches" AS t0_r62, "projects"."external_webhook_token" AS t0_r63, "projects"."pages_https_only" AS t0_r64, "projects"."packages_enabled" AS t0_r65, "projects"."merge_requests_author_approval" AS t0_r66, "projects"."pool_repository_id" AS t0_r67, "projects"."runners_token_encrypted" AS t0_r68, "projects"."bfg_object_map" AS t0_r69, "projects"."detected_repository_languages" AS t0_r70, "projects"."merge_requests_disable_committers_approval" AS t0_r71, "projects"."require_password_to_approve" AS t0_r72, "projects"."emails_disabled" AS t0_r73, "projects"."max_pages_size" AS t0_r74, "projects"."max_artifacts_size" AS t0_r75, "projects"."pull_mirror_branch_prefix" AS t0_r76, "projects"."remove_source_branch_after_merge" AS t0_r77, "projects"."marked_for_deletion_at" AS t0_r78, "projects"."marked_for_deletion_by_user_id" AS t0_r79, "projects"."suggestion_commit_message" AS t0_r80, "projects"."autoclose_referenced_issues" AS t0_r81, "routes"."id" AS t1_r0, "routes"."source_id" AS t1_r1, "routes"."source_type" AS t1_r2, "routes"."path" AS t1_r3, "routes"."created_at" AS t1_r4, "routes"."updated_at" AS t1_r5, "routes"."name" AS t1_r6 FROM "projects" LEFT OUTER JOIN "routes" ON "routes"."source_type" = 'Project' AND "routes"."source_id" = "projects"."id" WHERE ((LOWER(routes.path) = LOWER('e-rahn/e-rahn-idp-app'))) ORDER BY "projects"."id" ASC LIMIT 1; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=1.00..17.56 rows=1 width=807) (actual time=34728.127..34728.128 rows=1 loops=1) -> Nested Loop (cost=1.00..1253785.01 rows=75690 width=807) (actual time=34728.126..34728.126 rows=1 loops=1) -> Index Scan using index_routes_on_source_type_and_source_id on routes (cost=0.56..1040277.49 rows=79168 width=80) (actual time=34727.973..34727.973 rows=1 loops=1) Index Cond: ((source_type)::text = 'Project'::text) Filter: (lower((path)::text) = 'e-rahn/e-rahn-idp-app'::text) Rows Removed by Filter: 15827982 -> Index Scan using projects_pkey on projects (cost=0.43..2.70 rows=1 width=727) (actual time=0.129..0.129 rows=1 loops=1) Index Cond: (id = routes.source_id) Planning Time: 0.682 ms Execution Time: 34728.200 ms
gitlabhq_production=# \timingTiming is on.gitlabhq_production=# analyze routes;ANALYZETime: 7133.985 ms (00:07.134)gitlabhq_production=# explain analyze SELECT "projects"."id" AS t0_r0, "projects"."name" AS t0_r1, "projects"."path" AS t0_r2, "projects"."description" AS t0_r3, "projects"."created_at" AS t0_r4, "projects"."updated_at" AS t0_r5, "projects"."creator_id" AS t0_r6, "projects"."namespace_id" AS t0_r7, "projects"."last_activity_at" AS t0_r8, "projects"."import_url" AS t0_r9, "projects"."visibility_level" AS t0_r10, "projects"."archived" AS t0_r11, "projects"."merge_requests_template" AS t0_r12, "projects"."star_count" AS t0_r13, "projects"."merge_requests_rebase_enabled" AS t0_r14, "projects"."import_type" AS t0_r15, "projects"."import_source" AS t0_r16, "projects"."avatar" AS t0_r17, "projects"."approvals_before_merge" AS t0_r18, "projects"."reset_approvals_on_push" AS t0_r19, "projects"."merge_requests_ff_only_enabled" AS t0_r20, "projects"."issues_template" AS t0_r21, "projects"."mirror" AS t0_r22, "projects"."mirror_last_update_at" AS t0_r23, "projects"."mirror_last_successful_update_at" AS t0_r24, "projects"."mirror_user_id" AS t0_r25, "projects"."shared_runners_enabled" AS t0_r26, "projects"."runners_token" AS t0_r27, "projects"."build_coverage_regex" AS t0_r28, "projects"."build_allow_git_fetch" AS t0_r29, "projects"."build_timeout" AS t0_r30, "projects"."mirror_trigger_builds" AS t0_r31, "projects"."public_builds" AS t0_r32, "projects"."pending_delete" AS t0_r33, "projects"."last_repository_check_failed" AS t0_r34, "projects"."last_repository_check_at" AS t0_r35, "projects"."container_registry_enabled" AS t0_r36, "projects"."only_allow_merge_if_pipeline_succeeds" AS t0_r37, "projects"."has_external_issue_tracker" AS t0_r38, "projects"."repository_storage" AS t0_r39, "projects"."request_access_enabled" AS t0_r40, "projects"."has_external_wiki" AS t0_r41, "projects"."repository_read_only" AS t0_r42, "projects"."lfs_enabled" AS t0_r43, "projects"."description_html" AS t0_r44, "projects"."only_allow_merge_if_all_discussions_are_resolved" AS t0_r45, "projects"."repository_size_limit" AS t0_r46, "projects"."service_desk_enabled" AS t0_r47, "projects"."printing_merge_request_link_enabled" AS t0_r48, "projects"."auto_cancel_pending_pipelines" AS t0_r49, "projects"."cached_markdown_version" AS t0_r50, "projects"."last_repository_updated_at" AS t0_r51, "projects"."ci_config_path" AS t0_r52, "projects"."disable_overriding_approvers_per_merge_request" AS t0_r53, "projects"."delete_error" AS t0_r54, "projects"."storage_version" AS t0_r55, "projects"."resolve_outdated_diff_discussions" AS t0_r56, "projects"."remote_mirror_available_overridden" AS t0_r57, "projects"."only_mirror_protected_branches" AS t0_r58, "projects"."pull_mirror_available_overridden" AS t0_r59, "projects"."jobs_cache_index" AS t0_r60, "projects"."external_authorization_classification_label" AS t0_r61, "projects"."mirror_overwrites_diverged_branches" AS t0_r62, "projects"."external_webhook_token" AS t0_r63, "projects"."pages_https_only" AS t0_r64, "projects"."packages_enabled" AS t0_r65, "projects"."merge_requests_author_approval" AS t0_r66, "projects"."pool_repository_id" AS t0_r67, "projects"."runners_token_encrypted" AS t0_r68, "projects"."bfg_object_map" AS t0_r69, "projects"."detected_repository_languages" AS t0_r70, "projects"."merge_requests_disable_committers_approval" AS t0_r71, "projects"."require_password_to_approve" AS t0_r72, "projects"."emails_disabled" AS t0_r73, "projects"."max_pages_size" AS t0_r74, "projects"."max_artifacts_size" AS t0_r75, "projects"."pull_mirror_branch_prefix" AS t0_r76, "projects"."remove_source_branch_after_merge" AS t0_r77, "projects"."marked_for_deletion_at" AS t0_r78, "projects"."marked_for_deletion_by_user_id" AS t0_r79, "projects"."suggestion_commit_message" AS t0_r80, "projects"."autoclose_referenced_issues" AS t0_r81, "routes"."id" AS t1_r0, "routes"."source_id" AS t1_r1, "routes"."source_type" AS t1_r2, "routes"."path" AS t1_r3, "routes"."created_at" AS t1_r4, "routes"."updated_at" AS t1_r5, "routes"."name" AS t1_r6 FROM "projects" LEFT OUTER JOIN "routes" ON "routes"."source_type" = 'Project' AND "routes"."source_id" = "projects"."id" WHERE ((LOWER(routes.path) = LOWER('e-rahn/e-rahn-idp-app'))) ORDER BY "projects"."id" ASC LIMIT 1; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------ Limit (cost=7.05..7.05 rows=1 width=787) (actual time=1.571..1.571 rows=1 loops=1) -> Sort (cost=7.05..7.05 rows=1 width=787) (actual time=1.570..1.570 rows=1 loops=1) Sort Key: projects.id Sort Method: quicksort Memory: 25kB -> Nested Loop (cost=1.00..7.04 rows=1 width=787) (actual time=1.560..1.561 rows=1 loops=1) -> Index Scan using index_on_routes_lower_path on routes (cost=0.56..3.58 rows=1 width=80) (actual time=1.535..1.536 rows=1 loops=1) Index Cond: (lower((path)::text) = 'e-rahn/e-rahn-idp-app'::text) Filter: ((source_type)::text = 'Project'::text) -> Index Scan using projects_pkey on projects (cost=0.43..3.45 rows=1 width=707) (actual time=0.020..0.020 rows=1 loops=1) Index Cond: (id = routes.source_id) Planning Time: 1.931 ms Execution Time: 1.649 ms
Analyze was fast (7 seconds). It made Postgres query planner to switch one index on routes for another one, and add an extra sort stage. However, this plan is many orders of magnitued better, taking 2ms to execute the query (vs 35K ms).
I think "10:14 incident starts" above is wrong. It should be 9:14 and that removes the mystery of why there would be a 1 hour delay between manipulating the index and the incident starting. There was no 1 hour delay.
Database load in the replicas seems to have gone back to pre-incident levels. See graph below. Dashed blue vertical line is the instant the analyze was executed:
What we can conclude from now is that the ANALYZE routes; fixed the statistics for the offending query, and made the plan improve dramatically. A red flag of how bad the original plan was that an index scan was returning 80K rows (a very high value for an index read) and worse that it was filtering out 15M rows.
Why statistics on this table were so way off? It appears also clearly correlated with the reindex operation that was run, which started practically at the same time the incident. Next steps would be to investigate on this, why this reindex operation caused statistics to be off.
Plan is to work on this Monday next week, now that incident is resolved.
The logs for index_on_routes_lower_path for today:
{"severity":"INFO","time":"2020-10-24T09:12:25.669Z","correlation_id":null,"message":"Starting reindex of index_on_routes_lower_path"}{"severity":"INFO","time":"2020-10-24T09:12:25.672Z","correlation_id":null,"message":"creating replacement index tmp_reindex_11902236"}{"severity":"DEBUG","time":"2020-10-24T09:12:25.673Z","correlation_id":null,"message":"replacement index definition: CREATE INDEX CONCURRENTLY tmp_reindex_11902236 ON public.routes USING btree (lower((path)::text))"}{"severity":"INFO","time":"2020-10-24T09:13:59.928Z","correlation_id":null,"message":"swapping replacement index tmp_reindex_11902236 with index_on_routes_lower_path"}{"severity":"INFO","time":"2020-10-24T09:13:59.930Z","correlation_id":null,"method":"with_lock_retries","class":"Gitlab::Database::Reindexing::ConcurrentReindex","message":"Lock timeout is set","current_iteration":1,"lock_timeout_in_ms":100}{"severity":"INFO","time":"2020-10-24T09:13:59.935Z","correlation_id":null,"method":"with_lock_retries","class":"Gitlab::Database::Reindexing::ConcurrentReindex","message":"Migration finished","current_iteration":1,"lock_timeout_in_ms":100}{"severity":"INFO","time":"2020-10-24T09:13:59.938Z","correlation_id":null,"message":"Removing index public.tmp_reindex_11902236"}
So this process has three steps:
Create a new index tmp_reindex_11902236 with the same definition as index_on_routes_lower_path
In a transaction, rename and swap tmp_reindex_11902236 with index_on_routes_lower_path
Drop the old tmp_reindex_11902236
As a result, the index_on_routes_lower_path index has been swapped with a freshly built index.
There should never be a time where the index is not usable/does not exist. We make sure the newly created index is not marked INVALID before switching it. There is a time between steps where the same index definition exists twice (the newly built, temporary index and the original one).
In comparison, pg_repack performs (2) on a lower level and swaps files instead of renaming tables.
Sounds like there is state somewhere in PG that remembers it has to use the old index, in spite of EXPLAIN in (3) picking up the new index. And running ANALYZE would then reset that state.
As an aside note, it seems like as long there is statistical information available for the expression (from the original index), the new index will already get used. We can see that as soon as the new index has been created, it gets used - even though it doesn't have direct statistics associated with it. Only when we drop the original index, this also removes the statistics information - which leads to the index not being used anymore.
The ANALYZE is needed here because the index is a functional index (lower(path)).
A non-functional, complete index can rely on the base table's statistics, since values are obviously the same. However, a functional index may contain different values; a partial index (with WHERE) obviously contains a subset of the rows. In both cases, statistics for the base table are not applicable.
So yes, an extra ANALYZE will be required for indexes that are either functional or partial, prior to dropping the original one.
For a freshly created functional index, these statistics are indeed not present until the next ANALYZE. This is a takeaway that is also relevant in general, since we should be running ANALYZE after creating those indexes through migrations. Created gitlab-org/gitlab#272992 to track.
an extra ANALYZE will be required for indexes that are either functional or partial,
@ahachete This is reproducible for functional indexes, but I wasn't sure where to look for statistics for partial indexes. How can we validate this this behavior also applies to partial indexes?
Correct. Partial indexes use the same statistics of the main table, that was too quick thinking. Partial indexes do not have a different distribution, binning, values... from that of the main table. Even they may contain just a subset of the entries, values don't change. They would if you could query an index, and just the index, as it if were a separate table. In this case, yes, they'd require separate statistics. But it's not the case.
Ok, thanks for confirming @ahachete - this aligns with my (now updated) understanding, too. So we would only need to run ANALYZE when touching functional indexes, not for regular or partial indexes.
Interestingly, not all functional indexes require this - for example, an index on COALESCE(a, b) doesn't yield separate statistics in my testing. So there is smartness baked into that. We don't need to do the same though as it doesn't hurt to just ANALYZE the table afterwards.
FWIW, we also only have a low number of functional indexes anyways - so this is rather the exceptional case, too.
In the fog of the incident, it didn't occur to me to look at pg_stat_statements. Due to the time of week of the incident, with no application changes or feature flag toggles, I initially assumed it was user misuse.
Checking in on pg_stat_statements with hindsight, we see that we had a very strong signal indicating the problem:
We clearly see that three queries switched from consuming almost no database time to dominating it.
The queries had ids 4587915552457638526, 1735293465391623069 and 1575299198194618637
Focusing on just these queries, we see that the time spent on these queries went from close to 0 before the incident, to many seconds-per-second per postgres host.
Incidently these three queries are all of the form:
SELECT ... FROM "projects" LEFT OUTER JOIN "routes" ON "routes"."source_type" = $1 AND "routes"."source_id" = "projects"."id" WHERE ((LOWER(routes.path) = LOWER($2))) ORDER BY "projects"."id" ASC LIMIT $3
More hindsight: we also had some signals from the Postgres slowlogs. This chart shows the total duration logged in slowlog entries from web and api workers
As a corrective action, having non-redacted logs of the SQL statements may have saved us having to log into postgres servers and sift logs. This has currently been disabled due to security concerns, but we should consider whether we could redact string values (ie, everything between 'quotes') from SQL statements instead of the full statement.
Agreed. There should be no changes to Postgres below C2 on anything that's is not part of the regular functional application code. Also, these changes must always be reviewed by a DBRE (which did not happen on #2849 (closed))
@cdu1@craig-gomes - we will need help from your team to investigate and author the incident review. Please coordinate with @dawsmith (IMOC for this incident).
@albertoramos your team's help will also be needed here, in reviewing the CM workflow and outcome.
I can be the author and driver for the incident review. This will be my first in this role and I may have some questions. Who is the best person to contact if I have questions after reading through the process?
@craig-gomes - I can be that person. In this case, I would be happy to set up a meeting to "pair" on filling things in if you do have some questions after a first look.
@hphilipps and I can be responsible for some questions around timeline and response, but we'd really appreciate some help on explaining the root cause and 5 whys around the cause. It looks like @abrandl's comments above do a good job explaining the how.
It may be worthwhile figuring out if we would have had a chance finding this issue in staging.
In staging, the reindexing feature had been enabled for a couple of days (I just disabled it now). It rebuilt 81 indexes, among which were 3 functional indexes. No issues had been detected with this through regular QA testing.
OTOH, it didn't hit the index on routes (like in this incident) - which is a critical one indeed.
The (broken) assumption here was that it'd be enough to see a couple invocations of reindexing and we don't need to reindex all indexes in staging to validate this works correctly.
What I don't know is if we would have noticed the same problem in staging, given staging is much smaller in size. Would it be ok to simulate the problem in staging and see if we'd notice?
@abrandl -thank you. This is an interesting note. How does the cron determine which indexes to re-index then? Just checking that broken assumption more. If we tested in staging, but then the re-index picked different tables, is there some way to make the tests between staging and production more deterministic?
How does the cron determine which indexes to re-index then?
Currently we choose indexes at random. This is a MVP kind solution. We plan to replace this with estimating bloat levels and choosing the most impactful indexes first (highest bloat ratio or largest bloat size or similar).
If we tested in staging, but then the re-index picked different tables, is there some way to make the tests between staging and production more deterministic?
We've used staging more in terms of validating correctness of the procedure - i.e. checking the desired outcome is matched. This of course relies on your own perception of what you think needs to happen - which didn't include the problem we ran into.
Ideally, we'd have a production-sized dedicated environment for this. We would reindex all existing indexes and then perform some kind of testing on it. This could have surfaced the issue at hand.
So a combination of a thin cloned production database, a separate GitLab environment for it and running gitlab-qa might have helped here.
A potential corrective action that we should consider: adding a causal alert that notifies us when a single query or small set of queries is dominating postgres.
We could use anomaly detection for this, but this may be tricky as the pg_stat_statements metrics have very high cardinality and using 1w ranges for stddev_over_time queries may impact Prometheus adversely.
An alternative approach, which I suspect may have pretty good accuracy too would be an alert along the lines of "If the top 3 statements, by total query time, are consuming greater than X% of the total, for 5 minutes or more" alert.
This query shows how more than 80% of time spent in postgres queries was dominated by 3 queries. In a healthy database, load is spread through a far wider variety of queries.
Why haven't we encountered this before? We've been doing manual repacking a few times.
In previous repacking efforts, we've used manual invocations of pg_repack. This implements index repacking a bit differently, on a lower level. It swaps data files on disk instead of renaming and dropping the old index, which effectively retains the statistical information and doesn't require ANALYZE.
We've chosen to implement this in the application as an automated feature. This way, we can only leverage standard SQL - hence we ran into this issue since now we had to rename and drop the existing index.
Why did we lose statistics information while reindexing?
We follow the procedure recommended in Postgres documentation: Create a new, identical index - swap indexes by renaming - ultimately drop the old index.
A subset of functional indexes needs statistical information to inform the query planner. Those are tied to the existence of the index and are not being built automatically with the index. Hence, when creating the new index, we need to additionally build those statistics. This was not the case and as soon as we dropped the original index, those statistics were effectively missing.
This in turn led to the query planner making the wrong choice, effectively not using the index. Hence we saw significant increase in query times and database load.
Why didn't this cause problems in staging?
In staging, we enabled reindexing for about a week prior to production. It touched about 80 indexes and this didn't lead to any noticeable problems. The index that caused this incident was not among those 80 indexes.
It is unknown at this point if we would have even noticed in staging, given its smaller size and much less traffic. Statistics are also being auto-generated (autoanalyze) automatically in background, so we may not even have run into the same situation (timing-wise).
Just to clarify any confusion, PostgreSQL supports FUNCTIONS in indexes expressions and we often call this kind of index Functional Indexes.
To planner correctly understand the values generated by Functional Indexes PostgreSQL should generate statistics about it. Regular indexes (even partial) don't require index statistics, that is, new pg_statistic entry.
So when we create a new Functional Index or rebuild it concurrently (because it will create new relation) we SHOULD ALWAYS execute ANALYZE immediately after the (re)build process.
Have a look a self contained test case below:
-- 1) SetupBEGIN;DROPTABLEIFEXISTSt;CREATETABLEt(f1BIGSERIALPRIMARYKEY,f2TEXT)WITH(autovacuum_enabled=false);INSERTINTOt(f2)SELECTrepeat(chr(65+(random()*26)::int),(random()*300)::int)FROMgenerate_series(1,10000);CREATEINDEXt_idx1ONt(f2);CREATEINDEXt_idx2ONt(lower(f2));COMMIT;-- 2) Check Statistics (should have ZERO statistics for all indexes)SELECTcount(*)FROMpg_statisticWHEREstarelid='t'::regclass;SELECTcount(*)FROMpg_statisticWHEREstarelid='t_pkey'::regclass;SELECTcount(*)FROMpg_statisticWHEREstarelid='t_idx1'::regclass;SELECTcount(*)FROMpg_statisticWHEREstarelid='t_idx2'::regclass;-- 3) Run ANALYZE do CREATE table and functional index StatisticsANALYZEt;-- 4) Check Statistics (should exists statistcs for INDEX t_idx2)SELECTcount(*)FROMpg_statisticWHEREstarelid='t'::regclass;SELECTcount(*)FROMpg_statisticWHEREstarelid='t_pkey'::regclass;SELECTcount(*)FROMpg_statisticWHEREstarelid='t_idx1'::regclass;SELECTcount(*)FROMpg_statisticWHEREstarelid='t_idx2'::regclass;-- 5) Create new index CONCURRENTLY and swap it. Here we'll end up-- with a new fresh functional index without statistics :(CREATEINDEXCONCURRENTLY_tmp_idx2ONt(lower(f2));BEGIN;ALTERINDEXt_idx2RENAMETO_old_t_idx2;ALTERINDEX_tmp_idx2RENAMETOt_idx2;COMMIT;DROPINDEXCONCURRENTLY_old_t_idx2;SELECTcount(*)FROMpg_statisticWHEREstarelid='t'::regclass;SELECTcount(*)FROMpg_statisticWHEREstarelid='t_pkey'::regclass;SELECTcount(*)FROMpg_statisticWHEREstarelid='t_idx1'::regclass;SELECTcount(*)FROMpg_statisticWHEREstarelid='t_idx2'::regclass;-- 6) Run ANALYZE do CREATE table and functional index StatisticsANALYZEt;-- 7) Check Statistics (should exists for t_idx2)SELECTcount(*)FROMpg_statisticWHEREstarelid='t'::regclass;SELECTcount(*)FROMpg_statisticWHEREstarelid='t_pkey'::regclass;SELECTcount(*)FROMpg_statisticWHEREstarelid='t_idx1'::regclass;SELECTcount(*)FROMpg_statisticWHEREstarelid='t_idx2'::regclass;
Great example @fabrizio.ongres. Here is also an additional example showing that in the case of pg_repack, the stats persist (as @abrandl well explained above):
postgres=# create table t1 as select i from generate_series(1, 1000000) i;SELECT 1000000postgres=# create index i1 on t1 using btree((i+1));CREATE INDEXpostgres=# select count(*) from pg_statistic where starelid = (select oid from pg_class where relname = 'i1'); count ------- 0(1 row)postgres=# analyze t1;ANALYZEpostgres=# select count(*) from pg_statistic where starelid = (select oid from pg_class where relname = 'i1'); count ------- 1(1 row)postgres=# \! pg_repack --only-indexes -t t1 postgres -e 2>&1 | grep -i analyze postgres=# postgres=# postgres=# -- no ANALYZE reported ^^postgres=# postgres=# select count(*) from pg_statistic where starelid = (select oid from pg_class where relname = 'i1'); count ------- 1(1 row)postgres=# -- stats are still there ^^
Hi @NikolayS thanks for send the documentation patch to -hackers, will review it.
The pg_repack tool preserve the statistics because the old catalog is not dropped. At the end of repack it just swap the catalog between new and old relation:
Even swapping catalog pg_repack also execute an ANALYZE by default on the base table at the end of rebuild. It was introduced a long time ago and is safe because the rebuild process can took a long time and a lot of new data can change statistics so is safe run it as a post-task.
https://github.com/reorg/pg_repack/commit/5fe3f037
On the Postgres side the big problem I can see is:
Lack of documentation about the importance of statistics about Functional Indexes
Now with the new CONCURRENTLY option for REINDEX it will lost the statistics about Functional Indexes and a normal REINDEX it don't happen, so to me it's a bug
Even swapping catalog pg_repack also execute an ANALYZE by default on the base table at the end of rebuild.
Note that in my example, I use -e which reports all SQL commands issued. ANALYZE is not a part of them. When processing only index(es), pg_repack doesn't issue ANALYZE, and I don't see it in the links you've provided (I see ANALYZE in end of processing when we deal with table-level repacking, and also ability to skip it using --no-analyze – it's not relevant for index-only processing). I think we all agree that pg_repack doesn't have the problem of losing statistics because -- as you mentioned -- it swaps files, on physical level, so old statistics is preserved.
As for the issue of REINDEX CONCURRENTLY – a great find! We all should be very careful.
@NikolayS you're correct... Sorry for confusion! I misunderstood a bit this part of pg_repack source code. Using --only-indexes it will don't issue ANALYZE, just when provide a table.
About the REINDEX CONCURRENTLY issue we're working on it on Postgres community to find a good fix for it.
This is a viable approach @Finotto (update pg_index ... set indisvalid = false where ...). Such an index will continue receiving updates but won't be used by the planner. Updating indisvalid back to true would bring the index back to the picture. This is unofficial way of dealing with indexes although I used it several times and it worked as expected.
I agree that this would be helpful for this particular case and for the index maintenance not involving pg_repack; there is a separate problem with testing practices though, we can improve them.
@Finotto@NikolayS We've indeed used this approach before, for example to validate an index isn't needed anymore.
What isn't entirely clear to me with this suggestion is how this should work for an automated maintenance.
Disabling the index to allow for an "easier rollback" later effectively suggests that we intend to keep the disabled index around for a while. During this time, it'll cause overhead as it's continuously going to be updated. That's perhaps negligible, but we will need to work on many indexes in sequence - so there would be a time where quite a few disabled indexes would be around, which may incur considerable overhead.
How do we (automatically) decide when to remove the index finally? Would we wait an hour or a day until we drop it?
In case there is indeed a problem with the new index, how would the rollback look like? I'm doubtful, we'd go in and manually rename and enable the index again. This would have to be automated, too.
Considering this situation: We have two valid and identical indexes. Once we drop one of them, we still have one valid index around. There may be side effects from that (like with the missing statistics), but the index itself is intact and usable. Hence, we don't risk prolonged times where we'd have to build an entirely new index (which can take even hours).
I'm not onboard with this approach yet - I think it'd cause more problems and complexity than it solves. I'm happy to be convinced though, so feel free to add more details or create a separate issue for the proposal.
I was thinking about this issue and what if we create an EventTrigger to call ANALYZE at ddl_command_end for a CREATE INDEX that pg_index.indexprs IS NOT NULL (just indexes on expressions) ??
@fabrizio.ongres I suspect you meant pg_index.indexprs IS NOT NULL? Just to double check - partial indexes don't suffer from this (they don't have extra statistics).
I think we'll end up just implementing the additional ANALYZE in the reindexing logic, but the EventTrigger is an interesting option, too. Thanks for that!
@fabrizio.ongres I suspect you meant pg_index.indexprs IS NOT NULL? Just to double check - partial indexes don't suffer from this (they don't have extra statistics).
You're correct @abrandl thanks for point it out. Fixed my previous comment to avoid any confusion.
I think we'll end up just implementing the additional ANALYZE in the reindexing logic, but the EventTrigger is an interesting option, too. Thanks for that!
Cool! But just have in mind that the EventTrigger can take care of CREATE INDEX also. I mean, when you create a new index (as part of you dev process) and this index is on expression it will need an ANALYZE. So the idea of the EventTrigger is to cover both cases: 1) Reindexing and 2) New index on expressions.
But just have in mind that the EventTrigger can take care of CREATE INDEX also.
Good point, thanks @fabrizio.ongres . I do like the idea - but, IIRC, you will have to have SUPERUSER to create event triggers - which is unfortunately not the case for us typically. So I think we'll have to make this explicit - both for reindexing gitlab-org/gitlab#272997 (closed) but also any helpers that add functional indexes gitlab-org/gitlab#272992. Please let me know if you see a way around that, it'd be nice to patch this away in the background.
@abrandl unfortunately the permission system hasn't evolved enough to able us grant the CREATE EVENT TRIGGER to non-superusers.
Anyway, every time I needed a regular user execute a statement like a superuser (aka sudo) I implemented a PL/PgSQL as a wrapper for permission system to EXECUTE statements, but there are a big security risk, because with this approach a regular user can execute anything as SUPERUSER... but we can implement some logic inside to prevent user to run some commands.
Anyway I'll left here an example of the EventTrigger I have in mind:
CREATEORREPLACEFUNCTIONanalyze_index_on_expression()RETURNSevent_triggerAS$$DECLARErRECORD;BEGINFORrINSELECTindrelid::regclass::textAStable_name,object_identityASindex_nameFROMpg_event_trigger_ddl_commands()JOINpg_indexONindexrelid=objidANDindexprsISNOTNULLLOOPRAISELOG'Index % using expression created! Executing analyze on table %',r.index_name,r.table_name;EXECUTEformat('ANALYZE %s',r.table_name);ENDLOOP;END;$$LANGUAGEplpgsql;CREATEEVENTTRIGGERtrigger_analyze_index_on_expressionONddl_command_endWHENtagIN('CREATE INDEX')EXECUTEPROCEDUREanalyze_index_on_expression();
As noted upthread, I believe one of the keys is to thoroughly test any migration or database operation on a proper test environment.
This incident proves that minor bugs, even on lower activity periods, may cause significant degradation --even total effective downtime, as it happened here. Testing the migrations is not enough. Testing on development or staging environments, is not enough either.
These changes need to be tested on an environment as close to production as possible. This involves two dimensions:
Data equal or extremely close to production data.
Production or close to production traffic. Ideally, we should use traffic mirroring for this. Unfortunately, there's no practical solutions for this. Alternatively, we can simulate production traffic. There was work on this regards done for the 9.6-11 upgrade, and there's more in progress on https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11608. We may want to leverage this, until proper traffic mirroring could be used.
@dawsmith - For the corrective action number 8 related to following along on a DB change, I propose we track this suggested CA via an issue that creates an MR against the workflow for the Change Management handbook page.
Craig Gomeschanged the descriptionCompare with previous version
changed the description
Dave Smithchanged the descriptionCompare with previous version
While working on the reindexing fix, I noticed that there are a couple of inconsistencies in terms of database structure (production/staging vs. what we expect in the codebase), which are related to this incident:
Even if we had executed the reindexing on the problematic index (that one that caused the outage) in staging, this extra/unexpected index (see 2 above) would have hidden the problem and we would never have noticed in staging.
To be extra clear - no, we did not run reindexing on the problematic index in staging (but on many other indexes). Point is just that even if we had, it likely would have gone unnoticed due to the inconsistency.
Hence it's important we understand inconsistencies and align the database structure in staging and production with what we have in the codebase as close as possible. The epic is gitlab-org&3928 (comment 444891958) since this isn't only a GitLab.com problem - but GitLab.com is likely the biggest offender in that context (longest living installation). It also has a good overview of differences compared to what we expect in the codebase.