Disabled failover configuration can't define healthy primary
After enabling of the feature gitaly_distributed_reads
which relies on the enabled failover of the Praefect (suppose to be enabled by default)
[failover]
enabled = true
Each read operation ended up with error like:
{
"_index": "pubsub-praefect-inf-gprd-000191",
"_type": "_doc",
"_id": "ZQViuXMBhFFHcF3plPbQ",
"_version": 1,
"_score": null,
"_source": {
"@timestamp": "2020-08-04T12:11:51.901Z",
"ecs": {
"version": "1.4.0"
},
"host": {
"name": "pubsub-duplicate-praefect-inf-gprd"
},
"json": {
"peer.address": "10.220.8.16:46464",
"grpc.time_ms": 1.344,
"span.kind": "server",
"correlation_id": "zIYZlFC6yq2",
"tier": "sv",
"pid": 22418,
"level": "error",
"environment": "gprd",
"system": "grpc",
"error": "accessor call: get synced: primary is not healthy",
"stage": "main",
"grpc.service": "gitaly.CommitService",
"type": "praefect",
"grpc.request.deadline": "2020-08-04T12:12:00Z",
"hostname": "praefect-03-stor-gprd",
"msg": "finished streaming call with code Unknown",
"fqdn": "praefect-03-stor-gprd.c.gitlab-production.internal",
"grpc.request.fullMethod": "/gitaly.CommitService/FindCommits",
"grpc.meta.client_name": "gitlab-web",
"grpc.start_time": "2020-08-04T12:11:30Z",
"time": "2020-08-04T12:11:30.419Z",
"virtual_storage": "praefect-file01",
"grpc.code": "Unknown",
"grpc.method": "FindCommits",
"tag": "praefect",
"relative_path": "@hashed/fa/53/fa539965395b8382145f8370b34eab249cf610d2d6f2943c95b9b9d08a63d4a3.git",
"grpc.meta.deadline_type": "regular",
"shard": "default",
"grpc.meta.auth_version": "v2"
},
"type": "pubsub-duplicate-praefect-inf-gprd",
"message_id": "1416545797171672",
"publish_time": "2020-08-04T12:11:51.837Z"
},
...
}
This parameter is responsible for execution of health-checks to the gitaly nodes to verify if they are healthy or not. And this parameter is disable. it could be identified by the next message from the logs:
{
"_index": "pubsub-praefect-inf-gprd-000191",
"_type": "_doc",
"_id": "uZBXtnMBczpL6wEeH0X1",
"_version": 1,
"_score": null,
"_source": {
"@timestamp": "2020-08-03T22:00:28.834Z",
"ecs": {
"version": "1.4.0"
},
"host": {
"name": "pubsub-duplicate-praefect-inf-gprd"
},
"type": "pubsub-duplicate-praefect-inf-gprd",
"message_id": "1416162865219229",
"publish_time": "2020-08-03T22:00:28.670Z",
"json": {
"virtual_storage": "praefect-file01",
"level": "info",
"environment": "gprd",
"hostname": "praefect-03-stor-gprd",
"shard": "default",
"tier": "sv",
"pid": 22418,
"tag": "praefect",
"time": "2020-08-03T22:00:15.746Z",
"stage": "main",
"msg": "Failover checks are disabled",
"fqdn": "praefect-03-stor-gprd.c.gitlab-production.internal",
"type": "praefect"
}
},
...
}
The log: https://log.gprd.gitlab.net/goto/29843b47d974157d6e315931467e9d76.
The GetShard method suppose to return a node to operate on, but instead it returns ErrPrimaryNotHealthy
error.
The gauge rate(gitaly_praefect_node_last_healthcheck_up[5m])
shows no activity as well which means there are no health check requests from praefect to gitaly nodes.
/cc @zj-gitlab