Multiple spec failures with error PG::QueryCanceled: ERROR: canceling statement due to statement timeout
Problem Summary
Similar issue: #390752 (closed)
Today a job was reported to have taken over 100 min as 135 specs failed with the error PG::QueryCanceled: ERROR: canceling statement due to statement timeout
.
Complete error message:
ActionView::Template::Error:
PG::QueryCanceled: ERROR: canceling statement due to statement timeout
CONTEXT: while inserting index tuple (0,51) in relation "index_plans_on_name"
The specs involved are:
spec/features/discussion_comments/merge_request_spec.rb
spec/features/protected_tags_spec.rb
spec/features/issuables/markdown_references/internal_references_spec.rb
spec/features/projects/container_registry_spec.rb
spec/features/merge_requests/user_lists_merge_requests_spec.rb
spec/features/issues/incident_issue_spec.rb
spec/features/merge_requests/user_lists_merge_requests_spec.rb
spec/features/merge_request/merge_request_discussion_lock_spec.rb
spec/features/projects/branches/user_creates_branch_spec.rb
spec/features/projects/integrations/user_uses_inherited_settings_spec.rb
spec/features/search/user_searches_for_users_spec.rb
spec/features/explore/user_explores_projects_spec.rb
spec/features/tags/developer_views_tags_spec.rb
Proposed steps
Investigate what has contributed to the statement timeout. If the automatic retry is constantly resulting in a long running job like this one, we should also consider alternative approaches to either set a threshold for how many tests are allowed to retry, or stop the retry for such occurrences and let it fail.x
Investigation Summary
We are discovering that these failed jobs always start with ActiveRecord::RecordNotFound
errors such as:
ActiveRecord::RecordNotFound:
Couldn't find Project with 'id'=6
and
!!! before_all transaction has been already rollbacked and could work incorrectly
While the subsequent retries, or other tests that has state dependencies, could result in the statement timeout error described in the title.
The misused let_it_be
is the culprit. See required actions below for how we should mitigate this problem going forward.
Resolution/Required Actions
After through investigation, we believe the problem is not limited to the specs above, as the error is found in multiple specs, with the timeout occuring with different indexes. This is caused by abusing let_it_be
in tests without properly avoiding leaking states, as described in https://github.com/test-prof/test-prof/blob/ccd99b169b9e54c6ad7d705a9088919bad75ad1f/docs/recipes/let_it_be.md#state-leakage-detection
Required actions for closing this issue:
-
Update https://gitlab.com/gitlab-org/gitlab/-/blob/master/tooling/danger/specs.rb to include instructions on how to avoid leaking states -
document this under https://docs.gitlab.com/ee/development/testing_guide/flaky_tests.html
Follow up Actions:
-
Identify all places in the repo with this leaked state spec and address in each file. This however is going to be an on-going task and will take time. I'm going to mark this optional for closing the issue.