Restore from backup fails with missing data if I/O is overloaded

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Problem Statement

Customer is experiencing unreliable behavior when restoring from a successful backup (using Gitaly cluster as their storage backend). Restores fail with errors (see below examples of recent failed restore attempts), resulting in missing data. The failures happen as a result of busy I/O (writes fail), which is a cause for concern since data ends up missing, making the backup unreliable for disaster recovery.

Error 4:Deadline Exceeded. debug_error_string:{"created":"@1636467641.000279117","description":"Error received from peer ipv4:xxx.xxx.xxx.xxx:xxxx","file":"src/core/lib/surface/call.cc","file_line":1055,"grpc_message":"Deadline Exceeded","grpc_status":4}
Error 13:CreateRepositoryFromBundle: cmd wait failed fetching refs: error executing git hook

fatal:ref updates aborted by hook: exit status 128. 

debug_error_string:{"created":"@1636560010.579215601","description":"Error received from peer ipv4:xxx.xxx.xxx.xxx:xxxx","file":"src/core/lib/surface/call.cc","file_line":1055,"grpc_message":"CreateRepositoryFromBundle: cmd wait failed fetching refs: error executing git hook\nfatal: ref updates aborted by hook\n","grpc_status":13}

From the customer:

Backup restorations fail to complete successfully when performed against a 3-node Gitaly Cluster where each disk has 8000 IOPS, which is the official value recommended by Gitlab. If I change nothing other than the IOPS value of the target disks, backup restorations succeed.

Only after the customer has increased their IOPS value to 16K does a full restore complete successfully, however the overall integrity of the restore seems to be at-risk for any system under heavy I/O load (i.e. when the load exceeds the IOPS capabilities of the storage system).

  • GitLab version where this was experienced: gitlab-ee 13.12.12
  • OS: 18.04.3-Ubuntu

Reach

See this internal ZD ticket from a Premium customer for more details. This issue seems similar to #244333 (closed), which was closed due to inactivity / no additional information, but including that here since similar behavior has been previously reported.

Impact

Restoring the confidence and integrity of a restore from a complete backup, even if the system is under heavier I/O load is considered to be high impact. If this is due to a configuration-related problem, any recommendations to try will be appreciated.

Confidence

High confidence as the customer has done due diligence in repeating failed restorations until the IOPS value change to 16k.

Edited by 🤖 GitLab Bot 🤖