Restore from backup fails with missing data if I/O is overloaded
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Problem Statement
Customer is experiencing unreliable behavior when restoring from a successful backup (using Gitaly cluster as their storage backend). Restores fail with errors (see below examples of recent failed restore attempts), resulting in missing data. The failures happen as a result of busy I/O (writes fail), which is a cause for concern since data ends up missing, making the backup unreliable for disaster recovery.
Error 4:Deadline Exceeded. debug_error_string:{"created":"@1636467641.000279117","description":"Error received from peer ipv4:xxx.xxx.xxx.xxx:xxxx","file":"src/core/lib/surface/call.cc","file_line":1055,"grpc_message":"Deadline Exceeded","grpc_status":4}
Error 13:CreateRepositoryFromBundle: cmd wait failed fetching refs: error executing git hook
fatal:ref updates aborted by hook: exit status 128.
debug_error_string:{"created":"@1636560010.579215601","description":"Error received from peer ipv4:xxx.xxx.xxx.xxx:xxxx","file":"src/core/lib/surface/call.cc","file_line":1055,"grpc_message":"CreateRepositoryFromBundle: cmd wait failed fetching refs: error executing git hook\nfatal: ref updates aborted by hook\n","grpc_status":13}
From the customer:
Backup restorations fail to complete successfully when performed against a 3-node Gitaly Cluster where each disk has 8000 IOPS, which is the official value recommended by Gitlab. If I change nothing other than the IOPS value of the target disks, backup restorations succeed.
Only after the customer has increased their IOPS value to 16K does a full restore complete successfully, however the overall integrity of the restore seems to be at-risk for any system under heavy I/O load (i.e. when the load exceeds the IOPS capabilities of the storage system).
- GitLab version where this was experienced: gitlab-ee 13.12.12
- OS: 18.04.3-Ubuntu
Reach
See this internal ZD ticket from a Premium customer for more details. This issue seems similar to #244333 (closed), which was closed due to inactivity / no additional information, but including that here since similar behavior has been previously reported.
Impact
Restoring the confidence and integrity of a restore from a complete backup, even if the system is under heavier I/O load is considered to be high impact. If this is due to a configuration-related problem, any recommendations to try will be appreciated.
Confidence
High confidence as the customer has done due diligence in repeating failed restorations until the IOPS value change to 16k.