dashboards.gitlab.net (Grafana) unresponsive during S1 incidents

Summary

During severity1 incidents, dashboards.gitlab.net usually becomes unresponsive. We think this was due to the flood of requests from everyone who was watching the incidents.

Related Incident(s)

Originating issue(s): production#6253 (closed)

Desired Outcome/Acceptance criteria

  • dashboards.gitlab.net is available during high load.

Proposals/ideas:

  • Investigate whether the refresh rate can be disabled by default.
    • 5m
  • EOC remind people to not manually refresh the dashboard to help ease load on our monitoring. 🍅
    • How to communicate this out? Reliability Discussion and #infra-lounge?
    • Automate using woodhouse?
  • Investigate whether Grafana scales with the increased workload, can we rebuild it to scale?

🍅 : The idea doesn't scale/requires manual work.

Associated Services

Corrective Action Issue Checklist

  • link the incident(s) this corrective action arose out of
  • give context for what problem this corrective action is trying to prevent from re-occurring
  • assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4')
  • assign a priority (this will default to 'priority::4')