RootController, DashboardController and other "root-level" controllers should tolerate single Gitaly node failure
Spun out of gitlab-com/gl-infra/scalability#33 (comment 227285158)
During incident gitlab-com/gl-infra/production#1222 (closed), a single Gitaly node, file-13
failed hard. The machine could not be reset, rebooted or removed. We are currently awaiting cloud provider intervention to reset.
If you were unluckily enough to have repositories on this file server (and open TODOs pointing to those repos) and attempted to visit https://gitlab.com
you will experience a 500 error:
The number of Gitaly nodes is always increasing as GitLab.com handles more git content and more traffic. Since the likelihood of failure is f^n
, where f is the availability, and n is the number of Gitaly nodes. This means that as the number of Gitaly nodes increases, overall availability will decrease unless we build our system to be resiliant to single node changes.
Proposal
Certain "special" web controller/action endpoints should be hardened to tolerate failure of a single gitaly node.
This could be done by rescuing
around certain operations and/or template renders.
We should also have integration tests to ensure that these endpoints are capable of handling failure.
Ideally all pages would be able to handle these failures, but for endpoints such as MergeRequestController
the effort to reward factor would likely be too low.
Candidate Controllers
RootController
Dashboard::ProjectsController
ProjectsController
How would this look?
For example, if a TODO fails to render because of a Gitaly request failure, we could render an alternative content. Artists impression below:
While this is less than ideal (especially my HTML skills!) it's better than getting a 500 error for the entire site which will block you from doing any work.