Checksum errors when migrating repositories
Hello Gitaly team,
During the migration of archived projects in production, we are observing migration failure rates in the order of 6% due to checksum verification failures (Sentry: issue 1615724). We need guidance on how to approach these failures and we can do on the Gitaly side to address them.
Background
Per cost management initiatives, we are migrating archived projects to HDD-based storage nodes. This past weekend we executed a trial run on GitLab.com against nfs-file28
which entailed the move of about 4650 projects.
This trial run was performed in 3 waves:
- 100 projects (https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2204)
- 445 projects (https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2205)
- 4000 projects (https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2206)
Focusing on the 4K move, we found that 239 projects failed to move. While unrelated to this work, we found this on Geo's documentation: https://docs.gitlab.com/ee/administration/geo/disaster_recovery/background_verification.html#reconcile-differences-with-checksum-mismatches
We picked a sample of projects (for instance, id 11678907
):
gerir@beirut:~/Work/Infra/tmp/checksum_mismatch/11678907:cat file-ssd-28
50eaab75d6058b728f5e8635375e778abaa851f1 HEAD
ded1f6ece97cb69fbf6a769b877bdfa99262f3a4 refs/heads/PROXIMUS-STABLE-ANDROID
34c6687435c6420452737993898c707e1c4e6373 refs/heads/branding/proximus
50eaab75d6058b728f5e8635375e778abaa851f1 refs/heads/develop
943d374b1085fe15a245d06b6dc8820d29cf99a4 refs/heads/feature/gl-1
58bf1659c5f7c35fa88ed7467ad7f7290d870a3e refs/heads/feature/gl-2
b5fdbbfe0ee9179c28ec4af627cc2d15c6bbb52f refs/heads/feature/gl-982
e48904271a6f285ca9da64122b0b0b6eebc44ef7 refs/heads/master
0de19be939651601db708f07e1cc52ab6d7de6a3 refs/keep-around/0de19be939651601db708f07e1cc52ab6d7de6a3
310893e2681ce6e0679734b48316b9f68ecf7891 refs/keep-around/310893e2681ce6e0679734b48316b9f68ecf7891
4159c12cd514a2740b9fc19d544b8e9abc0fee0f refs/keep-around/4159c12cd514a2740b9fc19d544b8e9abc0fee0f
50eaab75d6058b728f5e8635375e778abaa851f1 refs/keep-around/50eaab75d6058b728f5e8635375e778abaa851f1
6717e5708eea0f68c81563b8395c4ea6cb7cb796 refs/keep-around/6717e5708eea0f68c81563b8395c4ea6cb7cb796
ad334ae2346310695909735b9f53cb5655f097b4 refs/keep-around/ad334ae2346310695909735b9f53cb5655f097b4
b5fdbbfe0ee9179c28ec4af627cc2d15c6bbb52f refs/keep-around/b5fdbbfe0ee9179c28ec4af627cc2d15c6bbb52f
dd8a26847be50f4e16f8a6cf04ef52714a703b99 refs/keep-around/dd8a26847be50f4e16f8a6cf04ef52714a703b99
e48904271a6f285ca9da64122b0b0b6eebc44ef7 refs/keep-around/e48904271a6f285ca9da64122b0b0b6eebc44ef7
310893e2681ce6e0679734b48316b9f68ecf7891 refs/merge-requests/1/head
dd8a26847be50f4e16f8a6cf04ef52714a703b99 refs/merge-requests/2/head
b5fdbbfe0ee9179c28ec4af627cc2d15c6bbb52f refs/merge-requests/3/head
gerir@beirut:~/Work/Infra/tmp/checksum_mismatch/11678907:cat file-hdd-01
e48904271a6f285ca9da64122b0b0b6eebc44ef7 HEAD
ded1f6ece97cb69fbf6a769b877bdfa99262f3a4 refs/heads/PROXIMUS-STABLE-ANDROID
34c6687435c6420452737993898c707e1c4e6373 refs/heads/branding/proximus
50eaab75d6058b728f5e8635375e778abaa851f1 refs/heads/develop
943d374b1085fe15a245d06b6dc8820d29cf99a4 refs/heads/feature/gl-1
58bf1659c5f7c35fa88ed7467ad7f7290d870a3e refs/heads/feature/gl-2
b5fdbbfe0ee9179c28ec4af627cc2d15c6bbb52f refs/heads/feature/gl-982
e48904271a6f285ca9da64122b0b0b6eebc44ef7 refs/heads/master
0de19be939651601db708f07e1cc52ab6d7de6a3 refs/keep-around/0de19be939651601db708f07e1cc52ab6d7de6a3
310893e2681ce6e0679734b48316b9f68ecf7891 refs/keep-around/310893e2681ce6e0679734b48316b9f68ecf7891
4159c12cd514a2740b9fc19d544b8e9abc0fee0f refs/keep-around/4159c12cd514a2740b9fc19d544b8e9abc0fee0f
50eaab75d6058b728f5e8635375e778abaa851f1 refs/keep-around/50eaab75d6058b728f5e8635375e778abaa851f1
6717e5708eea0f68c81563b8395c4ea6cb7cb796 refs/keep-around/6717e5708eea0f68c81563b8395c4ea6cb7cb796
ad334ae2346310695909735b9f53cb5655f097b4 refs/keep-around/ad334ae2346310695909735b9f53cb5655f097b4
b5fdbbfe0ee9179c28ec4af627cc2d15c6bbb52f refs/keep-around/b5fdbbfe0ee9179c28ec4af627cc2d15c6bbb52f
dd8a26847be50f4e16f8a6cf04ef52714a703b99 refs/keep-around/dd8a26847be50f4e16f8a6cf04ef52714a703b99
e48904271a6f285ca9da64122b0b0b6eebc44ef7 refs/keep-around/e48904271a6f285ca9da64122b0b0b6eebc44ef7
310893e2681ce6e0679734b48316b9f68ecf7891 refs/merge-requests/1/head
dd8a26847be50f4e16f8a6cf04ef52714a703b99 refs/merge-requests/2/head
b5fdbbfe0ee9179c28ec4af627cc2d15c6bbb52f refs/merge-requests/3/head
And:
gerir@beirut:~/Work/Infra/tmp/checksum_mismatch/11678907:diff file-ssd-28 file-hdd-01
1c1
< 50eaab75d6058b728f5e8635375e778abaa851f1 HEAD
---
> e48904271a6f285ca9da64122b0b0b6eebc44ef7 HEAD
Moving Forward
- This issue does not block these migrations, so we can (and will) continue moving projects
It is impossible to estimate what this will translate to at scale (i.e., across all storage nodes and for future migrations) given that failures can occur for a variety of repositories. If we were to assume that file-28
provides a somewhat representative indication of failure rates and sizes, then this will become a significant problem hurdle to meeting out cost management goals on the storage:
- These failures are creating orphaned repositories in the destination storage node, which will require cleanup eventually
- This is also leaving a percentage of repositories where they shouldn't stay (i.e., in SSDs)
- For the above run, in terms of space, this was about 7.7% of the space (~15% if we count orphaned repos)
I will continue to collect failure rate data to refine our estimates.
What's your guidance?