Fix error handling for Elasticsearch tasks API
Background
Found while working on #351381 (closed)
The Elasticsearch Tasks API used to include a failures
section in the json response but now appears to include error
instead. I suspect this is due to the upgrade to Elasticsearch 8.X. Unfortunately, there is no documentation on the json response for Tasks
Proposal
Create a single interface to interact with the Elasticsearch Tasks API and have it handle checking whether a task exists, is completed, and contains failures/error messages in response. We should check for failures
and error
sections in the json response (to support ES 7.X and 8.X) as well as completed
. Note: there are cases where a task is completed: true
but also has errors, see example below.
example json response
{
"completed": true,
"task": {
"node": "C8SSIuPRRhuteYwuqyzg1A",
"id": 21940431,
"type": "transport",
"action": "indices:data/write/update/byquery",
"status": {
"total": 0,
"updated": 0,
"created": 0,
"deleted": 0,
"batches": 0,
"version_conflicts": 0,
"noops": 0,
"retries": {
"bulk": 0,
"search": 0
},
"throttled_millis": 0,
"requests_per_second": -1,
"throttled_until_millis": 0
},
"description": "update-by-query [gitlab-production] updated with Script{type=inline, lang='painless', idOrCode='ctx._source.traversal_ids = '8338830-10851578-10851590-11090121-12845001-12845062-'', options={}, params={}}",
"start_time_in_millis": 1674882167245,
"running_time_in_nanos": 22122888,
"cancellable": true,
"cancelled": false,
"headers": {
"X-Opaque-Id": "b1bdfbc0e54d85014ef0a3b8e77d93f2",
"trace.id": "6ac74c2c6a0e4b6cd2403ef30cd6e8c8"
}
},
"error": {
"type": "search_phase_execution_exception",
"reason": "Partial shards failure",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 48,
"index": "gitlab-production-20221024-1119",
"node": "gOUN-D5LRcOxobqE_9tHYA",
"reason": {
"type": "exception",
"reason": "Trying to create too many scroll contexts. Must be less than or equal to: [500]. This limit can be set by changing the [search.max_open_scroll_context] setting."
}
}
]
}
}
Every spot where tasks are used should be updated to the new code:
-
call to client.tasks: ee/lib/gitlab/elastic/helper.rb -
migration call to helper.task_status: ee/elastic/migrate/20220713103500_delete_commits_from_original_index.rb -
call to elastic_helper.task_status: ee/app/services/elastic/cluster_reindexing_service.rb -
call to helper.task_status: ee/app/workers/concerns/elastic/migration_helper.rb -
call to helper.task_status: ee/elastic/migrate/20220118150500_delete_orphaned_commits.rb -
call to helper.task_status: ee/elastic/migrate/20220119120500_populate_commit_permissions_in_main_index.rb -
call to helper.task_status: ee/elastic/migrate/20221221110300_backfill_traversal_ids_to_blobs_and_wiki_blobs.rb -
call to helper.task_status: ee/spec/elastic/migrate/20220118150500_delete_orphaned_commits_spec.rb
Recommendation: The tests that make sure this method works should not be stubbed out, there should be a way to get a failed response back from the tasks API. This will be important to make sure we don't have version incompatibilities with AWS Open Search and Elasticsearch