Decide on disk layout for storage nodes

Following the results from https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6306, decide on a storage layout for vdevs. The 2 main contenders at the moment are:

RAIDZ1 with 9 disks (8 usable, 1 parity)
Single disk

Single disk is often called out as bad practice for ZFS, but GCP PDs are advertised as having high availability, data redundancy, and error correction. We need to evaluate this and weigh up whether it's worth adding redundancy at the ZFS level.

The throughput of the configurations must also be a factor. We don't want reduced throughput compared to today.

The total usable filesystem space per node must be at least 16TB (overlaps with discussion with quotas in https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6850), the same as we have today. Note that this will be less than the zpool space due to a reservation filesystem (see https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6850), and this zpool space may be less than the total PD disk space provisioned due to RAIDZ redundancy (if we choose raidz).

We intend to keep the same shard size as today. At the time of writing this is 32 x 16TB in prod. Changing this would involve changing too many things at once.

Another thing to take into consideration is that the zpool space may need to be substantially larger than what we expect due to snapshot bloat after a git repack operation. A ZFS snapshot taken before the repack will reference disjoint blocks to one taken after. Therefore the zpool disk usage for the repo will be doubled until the old snapshot passes out of the retention window. If repacking occurred simultaneously for every repo on a GitLab installation (which is unlikely), our filesystem usage would spike to double. Therefore in the worst case we would need at least twice the usable filesystem space as the data we intend to store on the node. However, since repacking of different repos should be spread out in time in realistic scenarios, we could use a lower multiple.

Also, decide on a storage layout for L2ARC. There is not much debate about this at the time of writing. The initially proposed configuration is 2 x 375GB ephemeral local SSDs (they are fixed size) connected via NVMe.

We have already decided to keep using n1-standard-32 VMs for the storage nodes, so memory and CPU will not change.

The deliverable of this issue is not necessarily code, as we will need to tackle this very early in the project, but a comment that we can use for reference when we come to write some terraform.

Performance characteristics of the candidate configurations are here: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6883

Finally, satisfy yourself that we don't make life too hard for ourselves when it comes to restoring an old backup from GCE snapshots if we choose raidz.

Edited Jun 05, 2019 by Hendrik Meyer