Discussion: notification system metrics
Why is it important?
We are currently aiming to deliver Container Registry GMAU: Track usage (&8213) which will in turn unblock https://gitlab.com/groups/gitlab-org/-/epics/8732+. Usage and data visibility are also part of our FY24 roadmap. The data we export is consumed by Rails, which in turn can be used to create usage dashboards. This means that the notification system is becoming more important and we will need to make sure it's reliable.
Notification system
The current notification system for the container registry is a simple component that sends events to a configured endpoint(s). The registry can be configured to send such notifications as follows:
notifications:
events:
includereferences: true
endpoints:
- name: alistener
disabled: false
url: https://my.listener.com/event
headers: <http.Header>
timeout: 1s
threshold: 10
backoff: 1s
ignoredmediatypes:
- application/octet-stream
ignore:
mediatypes:
- application/octet-stream
actions:
- pull
diagram source
sequenceDiagram
Client->>+Registry: Action: push | pull | delete
Registry ->>+ Registry: is action or media type ignored?
Registry->>+Endpoint(s): Event
In the current implementation, an event will live in memory until is delivered to the configured endpoints.
Supported events
The events that we currently support are:
- Manifest pushed (with/without tag information)
- Manifest deleted
- Tag deleted
- Blob pushed *
- Blob pulled *
- Blob mounted *
- Blob deleted *
- Although these events are being ignored on gitlab.com, they will be important for https://gitlab.com/groups/gitlab-org/-/epics/8732+.
Reliability of the notification system
We recently introduced a new Grafana dashboard registry: Webhook Notifications Detail via Add registry webhook notifications dashboard (gitlab-com/runbooks!5164 - merged) that shows some data about the notifications being sent out of the registry.
Current metrics
We have documented the current metrics being exported by the notification package. These are:
- registry_notifications_events_total{type="push"}
- registry_notifications_pending_total (gauge)
- registry_notifications_status_total{code=200 OK} (counter)
- registry_notifications_errors_total (counter) (not used in dashboard yet)
These allow us to graph:
- rates of events sent per second
- success rate
- events queued for sending
- total counts of events sent in a given time
Problem
We inherited these metrics from the original implementation inside the registry, however, the nomenclature and some of the metrics being exported are not clea. Some discussions around this can be found in here and here.
Although we have a few graphs now, the data there is not as clear or useful. For example the Event delivery failure rate
is usually zero and Event delivery error rate
is also close to zero most of the time, even though we can see ups and downs in the graph.
A more informative graph would be Events per second (by Status Code)
, but all we know is what kind of response status codes we get back from the notification endpoint.
Solution
This issue is meant to be used to discuss the current metrics exposed by the notification system and propose new metrics to be added, as well as updating or removing the existing ones. Ideally, we should create some follow-up issues based on these conversations.
Key points to focus on:
- what do we want to be able to monitor?
- when do we want to be alerted?
- are the current metrics clear enough?
- do we need to recreate the metrics from scratch?
- other thoughts/ideas
@gitlab-org/ci-cd/package-stage/container-registry-group please have a look at this and feel free to start new threads for different ideas you may have
I'm setting a due date for the end of the week for me to try to gather as much info as possible, but of course we can keep iterating as we move forward
What's next
We are already enhancing these metrics via the following issues:
- chore(notifications): enhance metrics with acti... (#851 - closed)
- Container Registry: create webhook notification... (gitlab-com/runbooks#101)
- Create alerts for webhook notification error tr... (#876 - closed)
Once we have better visibility into the system, we may decide to focus on Webhook notifications with at-least-once delive... (&9161) before moving on.
Other
Related to: