Timeout on RepackFull and GarbageCollect in big repo

Hi folks,

In the Gitaly instance which host our biggest repositories, recently we observed increased timeout on one of our bigger repo for Repack and GarbageCollect gRPCs.

Here are the log

[root@gitaly-2004 gitaly]# zgrep -E 'RepackFull|GarbageCollect' @400000005fcf41162b25111c.s | jq '.'
{
  "correlation_id": "aeLPWBA6BKa",
  "diskcache": "51b3f336-fb0e-454b-87fb-dd7ae38ddc9f",
  "grpc.meta.auth_version": "v2",
  "grpc.meta.client_name": "gitlab-sidekiq",
  "grpc.meta.deadline_type": "unknown",
  "grpc.method": "RepackFull",
  "grpc.request.deadline": "2020-12-08T13:41:11Z",
  "grpc.request.fullMethod": "/gitaly.RepositoryService/RepackFull",
  "grpc.request.glProjectPath": "<redacted>",
  "grpc.request.glRepository": "project-177",
  "grpc.request.repoPath": "@hashed/8c/d2/8cd2510271575d8430c05368315a87b9c4784c7389a47496080c1e615a2a00b6.git",
  "grpc.request.repoStorage": "gitaly-pool-001",
  "grpc.request.topLevelGroup": "@hashed",
  "grpc.service": "gitaly.RepositoryService",
  "grpc.start_time": "2020-12-08T07:41:11Z",
  "level": "info",
  "msg": "diskcache state change",
  "peer.address": "10.xxx.xxx.xx:22766",
  "pid": 31561,
  "span.kind": "server",
  "system": "grpc",
  "time": "2020-12-08T07:47:45.396Z"
}
{
  "correlation_id": "aeLPWBA6BKa",
  "error": "rpc error: code = Canceled desc = rpc error: code = Internal desc = signal: terminated",
  "grpc.code": "Canceled",
  "grpc.meta.auth_version": "v2",
  "grpc.meta.client_name": "gitlab-sidekiq",
  "grpc.meta.deadline_type": "unknown",
  "grpc.method": "RepackFull",
  "grpc.request.deadline": "2020-12-08T13:41:11Z",
  "grpc.request.fullMethod": "/gitaly.RepositoryService/RepackFull",
  "grpc.request.glProjectPath": "<redacted>",
  "grpc.request.glRepository": "project-177",
  "grpc.request.repoPath": "@hashed/8c/d2/8cd2510271575d8430c05368315a87b9c4784c7389a47496080c1e615a2a00b6.git",
  "grpc.request.repoStorage": "gitaly-pool-001",
  "grpc.request.topLevelGroup": "@hashed",
  "grpc.service": "gitaly.RepositoryService",
  "grpc.start_time": "2020-12-08T07:41:11Z",
  "grpc.time_ms": 394149.12,
  "level": "info",
  "msg": "finished unary call with code Canceled",
  "peer.address": "10.xxx.xxx.xx:22766",
  "pid": 31561,
  "span.kind": "server",
  "system": "grpc",
  "time": "2020-12-08T07:47:45.396Z"
}

I wonder if this is due to a timeout config somewhere? Reading gitlab rails code I think this is using a long timeout which is hard coded at 6 hours so it should not timeout after 394 seconds?

Or could it be that RepackFull is invoked too frequently and newer request would cancel the older request?

For reference this is a success request which finished after 30 minutes.

{
  "correlation_id": "l17D3GWggD9",
  "grpc.code": "OK",
  "grpc.meta.auth_version": "v2",
  "grpc.meta.client_name": "gitlab-sidekiq",
  "grpc.meta.deadline_type": "unknown",
  "grpc.method": "RepackFull",
  "grpc.request.deadline": "2020-12-08T14:54:51Z",
  "grpc.request.fullMethod": "/gitaly.RepositoryService/RepackFull",
  "grpc.request.glProjectPath": "<redacted>",
  "grpc.request.glRepository": "project-177",
  "grpc.request.repoPath": "@hashed/8c/d2/8cd2510271575d8430c05368315a87b9c4784c7389a47496080c1e615a2a00b6.git",
  "grpc.request.repoStorage": "gitaly-pool-001",
  "grpc.request.topLevelGroup": "@hashed",
  "grpc.service": "gitaly.RepositoryService",
  "grpc.start_time": "2020-12-08T08:54:51Z",
  "grpc.time_ms": 1777707.4,
  "level": "info",
  "msg": "finished unary call with code OK",
  "peer.address": "10.xxx.x.xx:4344",
  "pid": 31561,
  "span.kind": "server",
  "system": "grpc",
  "time": "2020-12-08T09:24:28.884Z"
}

Current version: Gitlab Omnibus 13.4.6-ee

Edited by Son Luong Ngoc
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information