RootController, DashboardController and other "root-level" controllers should tolerate single Gitaly node failure
During incident production#1222 (closed), a single Gitaly node, file-13
failed hard. The machine could not be reset, rebooted or removed. We are currently awaiting cloud provider intervention to reset.
If you were unluckily enough to have repositories on this file server (and open TODOs pointing to those repos) and attempted to visit https://gitlab.com
you will experience a 500 error:
Proposal
Certain "special" endpoints should be hardened to tolerate failure of a Single gitaly node.
This could be done by rescuing
around certain operations and/or template renders.
We should also have integration tests to ensure that these endpoints are capable of handling failure. cc @meks
Ideally all pages would be able to handle these failures, but for endpoints such as MergeRequestController
the effort to reward factor would likely be too low.
cc @jacobvosmaer-gitlab @zj-gitlab
Candidate Controllers
RootController
Dashboard::ProjectsController
ProjectsController
How would this look?
For example, if a TODO fails to render because of a Gitaly request failure, we could render an alternative content. Artists impression below:
While this is less than ideal (especially my HTML skills!) it's better than getting a 500 error for the entire site which will block you from doing any work.