The lessons we learned hosting GitLab.com on the cloud
Since moving to our new cloud we're encountering problems with servers reboots and disks errors.
Typically these: https://twitter.com/gitlabstatus/status/662189879997149184 "In the last 24h, we had two reboots of our Redis server seemingly caused by Azure (no kernel panics) and VHD read errors on the NFS server"
It seems that Azure offers less stability guarantees than AWS for individual instances. For stateless apps servers this is not a big problem. For stateful servers they recommend using their MS-SQL and Redis services (add LINK).
GitLab.com needs a file and PostgreSQL serivce which are not provided by Azure. Therefore our options are:
- Engage with Azure support to determine the cause of the reboots and errors and see if they can be mitigated. Azure support seems to be helpful and responsive.
- Make our setup highly available with automatic failover. See issues for Redis, PostgreSQL and CephFS.
- Forget about $500k credit for now (we can maybe use it to offer shared runners in the future) and move away to another provider. Since our monthly AWS bill was above $45k we should probably consider a bare metal provider such as Rackspace, Softlayer or Leaseweb #8 (closed)
I propose to explore all options at the same time since the current availability problems of GitLab.com are unacceptable. We should also publish the blog post we've been preparing ASAP.