Backup restore rake task isn't handling read-only errors
We have multiple reports from customers having problems while restoring a backup to a Gitaly + Praefect cluster.
During the restore task, multiple repositories fail to be restored raising the following error:
[Failed] restoring group/repo (@hashed/eb/0c/eb0c9cdcl33tl33tr3d4c73d798bae1162fe27f18d482c)
Error 9:repository is in read-only mode. debug_error_string:{"created":"@1616683445.784983667","description":"Error received from peer ipv4:x.x.x.x:2305","file":"src/core/lib/surface/call.cc","file_line":1055,"grpc_message":"repository is in read-only mode","grpc_status":9}
All raised by Praefect and by the same RPC: CreateRepositoryFromBundle
. The end result is an instance with a number of missing repositories.
The repositories going into read-only mode during the restore task is investigated in a separate issue, but the restore rake task itself should be robust enough to be able to deal with such failures, by either handling the error or implementing a retry mechanism to prevent an incomplete end result.
Workaround
Restore to a Praefect with a "clean" database. Do not do this with a functional production cluster. #3546 (comment 546966263)
Possible Solutions
- Investigate clearing the Praefect DB from the restore task.
- #3485 (closed) would address this as a long term solution to the root cause. #3546 (comment 552349407)