[Meta] Monitoring Plan

Note: This issue is intended to serve as rough outline for planning and prioritization of future Monitoring/Prometheus features, and is subject to change at any time. It should not be taken as a commitment to deliver any particular feature, or used for sales purposes.

Monitoring Plan

At the start of this year, we laid out a bold vision for the addition of monitoring capabilities to GitLab. As part of this project we will be focusing our efforts in two key areas: comprehensive monitoring of GitLab itself and integrated monitoring of end-user applications. In keeping with our philosophy of releasing early and often, we are planning to make steady progress throughout 2017 along both of these axes.

[TODO] Insert rationale for why each of these is important.

To begin the year, GitLab will be releasing initial integration with Prometheus. We will package Prometheus together with some of the monitoring tools we use to operate GitLab.com, as part of the Omnibus package. This will make it easier for all of our customers to have SaaS quality monitoring of their GitLab server on their own network.

With Prometheus now packaged, we then will be able to leverage our recent auto deployment capability on Kubernetes to begin monitoring end user applications automatically. By combining GitLab Auto Deploy together with Kubernetes, we are able to offer monitoring of system resources without any additional software or requirements on the end-user application. It just works!

In Q1 these metrics will be focused on monitoring two critical system resources, Memory and CPU, and displaying them on our Environment and Merge Request workflows.

In Q2 we plan to make further improvements to our ability to monitor customer apps, building on top of our past work. First we will add support for end-users who do not use our Auto Deploy feature, broadening the number of projects that can be supported.

Monitoring of Deploy Boards

With the arrival of deploy boards, and in particular support for features like Canary Deployments, the power of an integrated monitoring solution becomes clear. GitLab will be able to monitor the success of the canary nodes, and if necessary alert of take further action like rolling back the deploy.

Backlog

Alerting

For GitLab server itself (will hopefully slot in earlier, perhaps Q2, based on Grafana/Prometheus config)
- https://gitlab.com/gitlab-org/gitlab-ce/issues/27495
For specific customer monitored metrics

Issue integration

Automatically create issues for Alerts, include a snapshot of the problematic performance metric (and deploy boards)
Support users creating issues based on the performance metrics they are seeing, capture metrics around that timestamp, and allow commenting for collaboration

Export of Performance Data

In the event Support needs to collect performance data, we should allow a method to export and send this information to GitLab support for analysis.