Zero-downtime Reindexing unexpectedly fails with `Couldn't load task status` in 16.2.2

Summary

A customer (Internal ZD ticket) had tried to initiate Zero Downtime Reindexing but ended up with a Couldn't load task status error. I was able to replicate this on my own instance on 16.2.2, details below:

On my instance, I only have a few projects and a few documents that were available to index. I upgraded from 15.9.x => 15.11.13 => 16.2.4, checking for ES migrations in between.

gitlab-rake gitlab:elastic:info

root@sr-env-305341ba-omnibus:~# sudo gitlab-rake gitlab:elastic:info

Advanced Search
Server version:			7.17.3
Server distribution:		elasticsearch
Indexing enabled:		yes
Search enabled:			yes
Requeue Indexing workers:	no
Pause indexing:			yes
Indexing restrictions enabled:	no
File size limit:		1024 KiB
Indexing number of shards:	2

Indexing Queues
Initial queue:			0
Incremental queue:		0

Indices
- gitlab-production-20230728-1641:
  number_of_shards: 5
  number_of_replicas: 1
- gitlab-production-commits-20230728-1641:
  number_of_shards: 5
  number_of_replicas: 1
- gitlab-production-epics-20230821-1925:
  number_of_shards: 5
  number_of_replicas: 1
- gitlab-production-issues-20230728-1641:
  number_of_shards: 5
  number_of_replicas: 1
- gitlab-production-merge_requests-20230728-1641:
  number_of_shards: 5
  number_of_replicas: 1
- gitlab-production-notes-20230728-1641:
  number_of_shards: 5
  number_of_replicas: 1
- gitlab-production-projects-20230821-1910:
  number_of_shards: 5
  number_of_replicas: 1
- gitlab-production-users-20230728-1641:
  number_of_shards: 5
  number_of_replicas: 1
- gitlab-production-wikis-20230821-1830:
  number_of_shards: 5
  number_of_replicas: 1

I originally kicked off zero downtime reindexing and noticed that it paused for a significant amount of time. It seems to be expected that we pause indexing but I was surprised to see how long it had taken to start any sort of reindexing task. I ended up adding some code to ensure that it was going to be queued up. I then unpaused indexing to confirm that indexing was working again. At that point, the reindexing tasks started up.

The last message stalled here for 10 minutes:

{"severity":"INFO","time":"2023-08-22T15:30:08.894Z","meta.caller_id":"ElasticClusterReindexingCronWorker","correlation_id":"8211cc161a057d51c4056378579e4e72","meta.root_caller_id":"Cronjob","meta.feature_category":"global_search","meta.client_id":"ip/","message":"Reindex task 6bdgde18QDW1BDaCAollQg:278784 from gitlab-production-users-20230728-1641 to gitlab-production-users-20230822-1530-reindex-1-0 started for slice 9."}

At which point, I noticed that my ES logs were filling up with errors like so:

[2023-08-22T15:30:08,355][WARN ][o.e.a.s.TransportClearScrollAction] [sr-env-305341ba-elasticsearch] Clear SC failed on node[{sr-env-305341ba-elasticsearch}{6bdgde18QDW1BDaCAollQg}{x6jzrj6USdCsYZgfxUJezA}{10.128.20.2}{10.128.20.2:9300}{cdfhilmrstw}{ml.machine_memory=8320143360, xpack.installed=true, transform.node=true, ml.max_open_jobs=512, ml.max_jvm_size=4160749568}]
java.lang.IllegalArgumentException: sliceId must be greater than 0 but was [-1]

FULL ERROR

[2023-08-22T16:00:13,443][WARN ][o.e.a.s.TransportClearScrollAction] [sr-env-305341ba-elasticsearch] Clear SC failed on node[{sr-env-305341ba-elasticsearch}{6bdgde18QDW1BDaCAollQg}{x6jzrj6USdCsYZgfxUJezA}{10.128.20.2}{10.128.20.2:9300}{cdfhilmrstw}{ml.machine_memory=8320143360, xpack.installed=true, transform.node=true, ml.max_open_jobs=512, ml.max_jvm_size=4160749568}]
java.lang.IllegalArgumentException: sliceId must be greater than 0 but was [-1]
	at org.elasticsearch.index.reindex.BulkByScrollTask$Status.checkPositive(BulkByScrollTask.java:882) ~[elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.index.reindex.BulkByScrollTask$Status.<init>(BulkByScrollTask.java:446) ~[elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.index.reindex.WorkerBulkByScrollTaskState.getStatus(WorkerBulkByScrollTaskState.java:96) ~[elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.index.reindex.BulkByScrollTask.getStatus(BulkByScrollTask.java:78) ~[elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.reindex.AbstractAsyncBulkByScrollAction.buildResponse(AbstractAsyncBulkByScrollAction.java:281) ~[?:?]
	at org.elasticsearch.reindex.AbstractAsyncBulkByScrollAction.lambda$finishHim$3(AbstractAsyncBulkByScrollAction.java:562) ~[?:?]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) ~[elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.index.reindex.ClientScrollableHitSource.cleanup(ClientScrollableHitSource.java:129) ~[elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.index.reindex.ScrollableHitSource.lambda$close$2(ScrollableHitSource.java:115) ~[elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.index.reindex.ClientScrollableHitSource$2.onResponse(ClientScrollableHitSource.java:116) ~[elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.index.reindex.ClientScrollableHitSource$2.onResponse(ClientScrollableHitSource.java:112) ~[elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:88) ~[elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:82) ~[elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.action.search.ClearScrollController.onFreedContext(ClearScrollController.java:133) ~[elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.action.search.ClearScrollController.lambda$cleanScrollIds$2(ClearScrollController.java:117) ~[elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:136) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:43) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1471) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.transport.TransportService$DirectResponseChannel.processResponse(TransportService.java:1549) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1529) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:41) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.action.search.SearchTransportService.lambda$registerRequestHandler$3(SearchTransportService.java:454) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:341) [x-pack-security-7.17.3.jar:7.17.3]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:417) [x-pack-security-7.17.3.jar:7.17.3]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:67) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.transport.TransportService.sendLocalRequest(TransportService.java:1037) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.transport.TransportService.access$100(TransportService.java:66) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.transport.TransportService$3.sendRequest(TransportService.java:143) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:975) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor.sendWithUser(SecurityServerTransportInterceptor.java:206) [x-pack-security-7.17.3.jar:7.17.3]
	at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor.access$300(SecurityServerTransportInterceptor.java:53) [x-pack-security-7.17.3.jar:7.17.3]
	at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$1.sendRequest(SecurityServerTransportInterceptor.java:167) [x-pack-security-7.17.3.jar:7.17.3]
	at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:874) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.action.search.SearchTransportService.sendFreeContext(SearchTransportService.java:130) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.action.search.ClearScrollController.lambda$cleanScrollIds$4(ClearScrollController.java:114) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:136) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.action.search.SearchScrollAsyncAction.collectNodesAndRun(SearchScrollAsyncAction.java:116) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.action.search.ClearScrollController.cleanScrollIds(ClearScrollController.java:106) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.action.search.ClearScrollController.lambda$new$1(ClearScrollController.java:72) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.action.search.ClearScrollController.run(ClearScrollController.java:81) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.action.search.TransportClearScrollAction.doExecute(TransportClearScrollAction.java:39) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.action.search.TransportClearScrollAction.doExecute(TransportClearScrollAction.java:19) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:179) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.action.support.ActionFilter$Simple.apply(ActionFilter.java:53) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:177) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.apply(SecurityActionFilter.java:145) [x-pack-security-7.17.3.jar:7.17.3]
	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:177) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:154) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:82) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.client.node.NodeClient.executeLocally(NodeClient.java:95) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:73) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:407) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.client.support.AbstractClient.clearScroll(AbstractClient.java:637) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.index.reindex.ClientScrollableHitSource.clearScroll(ClientScrollableHitSource.java:112) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.index.reindex.ScrollableHitSource.close(ScrollableHitSource.java:115) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.reindex.AbstractAsyncBulkByScrollAction.finishHim(AbstractAsyncBulkByScrollAction.java:560) [reindex-client-7.17.3.jar:7.17.3]
	at org.elasticsearch.reindex.Reindexer$AsyncIndexBySearchAction.finishHim(Reindexer.java:244) [reindex-client-7.17.3.jar:7.17.3]
	at org.elasticsearch.reindex.AbstractAsyncBulkByScrollAction.refreshAndFinish(AbstractAsyncBulkByScrollAction.java:522) [reindex-client-7.17.3.jar:7.17.3]
	at org.elasticsearch.reindex.AbstractAsyncBulkByScrollAction.prepareBulkRequest(AbstractAsyncBulkByScrollAction.java:370) [reindex-client-7.17.3.jar:7.17.3]
	at org.elasticsearch.reindex.AbstractAsyncBulkByScrollAction$1.doRun(AbstractAsyncBulkByScrollAction.java:345) [reindex-client-7.17.3.jar:7.17.3]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.common.util.concurrent.RunOnce.run(RunOnce.java:29) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.index.reindex.WorkerBulkByScrollTaskState$DelayedPrepareBulkRequest.lambda$new$0(WorkerBulkByScrollTaskState.java:254) [elasticsearch-7.17.3.jar:7.17.3]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.17.3.jar:7.17.3]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]

This is very similar to the customer in the ticket.

The Couldn't load task status error only appeared after 10 minutes:

{"severity":"ERROR","time":"2023-08-22T15:40:08.943Z","meta.caller_id":"ElasticClusterReindexingCronWorker","correlation_id":"8026fc978c3b1b097ef67dd0ca5483e4","meta.root_caller_id":"Cronjob","meta.feature_category":"global_search","meta.client_id":"ip/","message":"elasticsearch_reindex_error","error":"Couldn't load task status","gitlab_task_id":1,"gitlab_task_state":"reindexing"}

Just to confirm that my unpausing didn't impact anything, I restarted the reindexing one more time to confirm that it would end up in the same state:

irb(main):004:0> Elastic::ReindexingTask.current
=>
#<Elastic::ReindexingTask:0x00007fa4d32891a8
 id: 2,
 created_at: Tue, 22 Aug 2023 15:49:21.389153000 UTC +00:00,
 updated_at: Tue, 22 Aug 2023 15:50:04.002551000 UTC +00:00,
 state: "indexing_paused",
 in_progress: true,
 error_message: nil,
 delete_original_index_at: nil,
 max_slices_running: 60,
 slice_multiplier: 2,
 targets: nil>

irb(main):005:0> Elastic::ReindexingTask.running?
=> true
irb(main):008:0> Elastic::DataMigrationService.pending_migrations?
=> false

Again, we paused at slice 9 for 10 minutes before it eventually errored.

{"severity":"INFO","time":"2023-08-22T16:00:13.442Z","meta.caller_id":"ElasticClusterReindexingCronWorker","correlation_id":"8dfecfcabff6c5252649545b2c638324","meta.root_caller_id":"Cronjob","meta.feature_category":"global_search","meta.client_id":"ip/","message":"Reindex task 6bdgde18QDW1BDaCAollQg:292192 from gitlab-production-users-20230728-1641 to gitlab-production-users-20230822-1600-reindex-2-0 started for slice 9."}

{"severity":"ERROR","time":"2023-08-22T16:10:07.794Z","meta.caller_id":"ElasticClusterReindexingCronWorker","correlation_id":"9a522f7d2ebfdb1a67f543573140ee67","meta.root_caller_id":"Cronjob","meta.feature_category":"global_search","meta.client_id":"ip/","message":"elasticsearch_reindex_error","error":"Couldn't load task status","gitlab_task_id":2,"gitlab_task_state":"reindexing"}

The GitLab UI also shows Documents reindexed: 0 (0.0%) for each index. Using sudo gitlab-rake gitlab:elastic:mark_reindex_failed doesn't work because once it fails with the error it is no longer in progress.

I'm not totally clear on the following:

Why we end up in a Couldn't load task status state. Is this a slice failure?
What we are doing to trigger the reindexing via the API.
- What does Clear SC failed on node mean
- What does sliceId must be greater than 0 but was [-1] imply?

My understanding is that we're doing manual slicing but I couldn't figure out how to determine if we are sending a -1 intentionally or not.

Steps to reproduce

See above

What is the current bug behavior?

Zero-downtime Reindexing fails with Couldn't load task status with a default configuration on 16.2.2

What is the expected correct behavior?

Zero-downtime reindexing should not fail with Couldn't load task status with a default configuration on 16.2.2

Zero-downtime Reindexing unexpectedly fails with `Couldn't load task status` in 16.2.2

Summary

Steps to reproduce

What is the current bug behavior?

What is the expected correct behavior?

Possible fixes