Service Status Dashboard
Often times it is important to be able to communicate to customers the status and health of your SaaS service. With the ability to comprehensively monitor, alert, and provide high level metrics (see gitlab-ce#38373) we have a significant amount of the information needed to provide a simplified status service for customers.
- Whether or not it is publicly reachable, and the latency involved (#3046 (closed))
- If smoke tests are passing in Production (not just if pages load) (#3554)
- If alerts are firing, and whether they have been acknowledged as "real" (#3555 (closed))
- Insight into individual micro-services or components
We can use this information and more to:
- Provide insight into current status (up/down), for each component of the service
- Provide a historical view into key metrics
- Response times
There are some considerations to take into account, around availability of this service in general in certain events.
- Updates of GitLab
- Infrastructure issues if GitLab is hosted in a single provider/region
We can start down this road by:
- Build a public page to surface this information cleanly, perhaps leveraging GitLab Pages in some unique form.
- Enhance the alerting support, to allow an administrator to acknowledge an issue is in fact real as well as resolved
- Once an event has been marked as valid, update and continue to update the service status dashboard until resolved
- Provide visibility of some metrics, optionally, that do not require manual intervention. Worldwide latency, smoke tests, etc.
- Eventually these could "vote" and if more than X% of these turn red, automatically create an event
- Provide a method for text and other information to be shared on this page, tweets to twitter, etc.
- If significant take up, can stand up own SaaS service to host the updates, perhaps eventually execute the runners for testing