Skip to content

RootController, DashboardController and other "root-level" controllers should tolerate single Gitaly node failure

During incident production#1222, a single Gitaly node, file-13 failed hard. The machine could not be reset, rebooted or removed. We are currently awaiting cloud provider intervention to reset.

If you were unluckily enough to have repositories on this file server (and open TODOs pointing to those repos) and attempted to visit https://gitlab.com you will experience a 500 error:

image

Proposal

Certain "special" endpoints should be hardened to tolerate failure of a Single gitaly node.

This could be done by rescuing around certain operations and/or template renders.

We should also have integration tests to ensure that these endpoints are capable of handling failure. cc @meks

Ideally all pages would be able to handle these failures, but for endpoints such as MergeRequestController the effort to reward factor would likely be too low.

cc @jacobvosmaer-gitlab @zj-gitlab

Candidate Controllers

  • RootController
  • Dashboard::ProjectsController
  • ProjectsController

How would this look?

For example, if a TODO fails to render because of a Gitaly request failure, we could render an alternative content. Artists impression below:

image

While this is less than ideal (especially my HTML skills!) it's better than getting a 500 error for the entire site which will block you from doing any work.

Edited by Rachel Nienaber