Discuss alternative for measuring "global metrics" in the rails application with prometheus

@brodock, Please add a group or category label to identify issue ownership.

You can refer to the Features by Group handbook page for guidance.

If you are unsure about the correct group, please do not leave the issue without a group label, and refer to GitLab's shared responsibility functionality guidelines for more information on how to triage this kind of issue.

This message was generated automatically. You're welcome to improve it.

changed the description

@gitlab-org/maintainers/rails-backend Please consider the discussion above. We don't have a "silver bullet" solution today for all the Prometheus metrics, and the proposal above may help close the gaps.

This issue was automatically tagged with the label ~"group::application performance" by TanukiStan, a machine learning classification model, with a probability of 0.99.

If this label is incorrect, please tag this issue with the correct group label as well as automation:ml wrong to help TanukiStan learn from its mistakes.

If you are unsure about the correct group, please do not leave the issue without a group label. Please refer to GitLab's shared responsibility functionality guidelines for more information on how to triage this kind of issues.

Authors who do not have permission to update labels can leave the issue to be triaged by group leaders initially assigned by TanukiStan

This message was generated automatically. You're welcome to improve it.

added automation:ml groupcloud connector labels

added devopsdata stores sectioncore platform labels

changed the description

A better approach could be to have this information scraped through an exporter (for example https://gitlab.com/gitlab-org/gitlab-exporter/), that reads the information directly from a datasource on a scrape.

Is this just shifting the same problem to gitlab-exporter? If you have multiple processes for gitlab-exporter, you still have the same issue about interpreting differing values in the metric.

In any case, Ben Kochie thought that we should deprecate gitlab-exporter since exporting metrics through a separate process is an anti-pattern for Prometheus.

But I don't think this approach can be a blanket solution for other kinds of state that we want to gather from the application into metrics.

Prometheus multiprocess metrics have a concept of aggregation modes: min, sum, max, etc. What other state are you thinking about?

A better approach could be to have this information scraped through an exporter (for example https://gitlab.com/gitlab-org/gitlab-exporter/), that reads the information directly from a datasource on a scrape.

This is precisely why gitlab-exporter exists and it already does things like these e.g. via the database table probe.

Is this just shifting the same problem to gitlab-exporter? If you have multiple processes for gitlab-exporter, you still have the same issue about interpreting differing values in the metric.

It depends on how you run and deploy it. On SaaS, gitlab-exporter runs directly on a postgres node. So it reads directly from the source of truth.

In any case, Ben Kochie thought that we should deprecate gitlab-exporter since exporting metrics through a separate process is an anti-pattern for Prometheus.

I agree. The way Ben suggested to solve this back then was to perform an ordinary HTTP request to /-/metrics instead. This will be load balanced and only ever hit a single Puma node, and assuming ACID is intact, will always read the correct value within the confines of possible replication lag.

The reason we are not doing this currently is because that endpoint services all node metrics, which is expensive and competes for CPU with user-facing requests. But perhaps a new endpoint can be established, or this one be tweaked, to only service some metrics, i.e. those that are known to be application-global.

@stanhu the issue with boolean metrics, is that no matter what aggregation mode you choose you don't get what you want.

As an example:

min(maintenance_mode:0, maintenance_mode:1, maintenance_mode:1) = 0
max(maintenance_mode:0, maintenance_mode:1, maintenance_mode:1) = 1
sum(maintenance_mode:0, maintenance_mode:1, maintenance_mode:1) = 2

Unless you expect some metric to "always grow" it can't easily take a shortcut. Even in cases where you expect things to always grow (number of projects), after a big project cleanup, it should go down, if you keep using max() you will have wrong results for a while.

@brodock Eventually those values should converge, though? Is the issue that the values are cached for too long?

@brodock I'm unsure what problem we are trying to solve. I think it is valid that in a transition state the value converges towards the stable state. So, my question is why the undefined state is bad - since this appears to me a expected system behavior, given the fact that we perform multi-level caching of informations reflective of an actual process state?

Of course we have resolution of sampling problem as well, but this is I think also expected. Doing push-gateway would continue to have sampling problem (a state being tracked with a delay).

I'm not the user of those datas, but I'm assuming if I were, I would not like to be in a situation where I don't have a clue what is the actual state of the system for several minutes.

The particular issue where this was triggered relates to calculating SLAs. In case of an incident, having an inconsistent state of the system could not help isolate a problem.

I see its fine to have this level of imprecision when we mostly look at aggregated data (ex: performance of a specific endpoint, based on thousands of request in a given period). In case of aggregated data, small errors are basically statistically irrelevant.

But, there are some classes of data that this is not desirable, and for those we need to have a solution.

I'm not the user of those datas, but I'm assuming if I were, I would not like to be in a situation where I don't have a clue what is the actual state of the system for several minutes.

I think you need to be more specific about the requirements here, and since you're talking about maintenance_mode let's focus on that. What prevents us today from making that metric read from the database every time it's requested? If we're worried about causing undue database load, we could cache the value in Redis for N seconds and only go to the DB on a cache miss.

But, there are some classes of data that this is not desirable, and for those we need to have a solution.

Again, I'm not sure what problem you're trying to solve.

I think you need to be more specific about the requirements here, and since you're talking about maintenance_mode let's focus on that.

@stanhu @brodock I don't think there's anything to do for maintenance_mode. That's a metric that will currently report from whatever Sidekiq-pod that runs it. Because it's slow moving, we can easily aggregate as suggested. In that case max(maintenance_mode) would be 1 for a little too long (maximum 5 minutes, the lookback delta). In this case, I think it is fine.

However, I don't think we should use this approach as "the way" of exporting instance wide (state) information into Prometheus, as the documentation in !114981 (diffs) was doing at first. Things like row counts, or other information that changes faster aren't really suitable for this kind of exporting.

What prevents us today from making that metric read from the database every time it's requested? If we're worried about causing undue database load, we could cache the value in Redis for N seconds and only go to the DB on a cache miss.

I think this is a nice idea! Isn't that what the MetricsController was supposed to be for?

I'm not sure we're currently using that endpoint, and if we're scraping it on GitLab.com. It was my understanding that we get most of our metrics from an in-process exporter. That will include all of the metrics emitted by that process from a separate thread. I'm not sure if this approach is used for self-managed instances.

Would it make sense to add another action to the MetricsController that exports instance wide information? It could eventually include the information we currently get from gitlab-exporter and we could deprecate that.

I think this is a nice idea! Isn't that what the MetricsController was supposed to be for?

Yeah, looks like it. Is there a reason why maintenance_mode isn't available via MetricsController?

Would it make sense to add another action to the MetricsController that exports instance wide information? It could eventually include the information we currently get from gitlab-exporter and we could deprecate that.

I think that's a good way to start.

I think this is a nice idea! Isn't that what the MetricsController was supposed to be for?

Side note here: the way I understand the history behind this controller is that it was the original Prometheus text exposition endpoint for all of GitLab (SaaS and self-managed). This is how Prometheus should work, i.e. scrape the system being monitored directly if it can be done.

The reason this didn't work for us was performance and availability. Since Puma serves these requests, it could lock up a worker thread for long periods of time, and memory use was high. So we introduced these dedicated exporters to take the metrics load off of Puma and the user path.

Nowadays this endpoint still serves the exact same data as our dedicated exporters, but is the recommended way to serve metrics for small self-managed installations. This is documented here: https://docs.gitlab.com/ee/administration/monitoring/prometheus/web_exporter.html

So if we want to change or extend this controller, we should keep the following in mind:

It must continue to work as the primary way to ingest Prometheus text metrics
If we use it on SaaS, we should be cautious with the kind of work this controller performs

Eventually those values should converge, though? Is the issue that the values are cached for too long?

In case of the maintenance mode, it may be enabled for less then the time it takes to converge. As an example if we get a lot of 500s / system degradation / anything that requires write but got rejected due to maintenance mode, you can't correlate that with the fact the system was in a state of exception.

With silent mode being prioritized, we will have a similar situation where a flag will disable a lot of functionality in the application and you would want to have that present in our grafana dashboards to understand when it was enabled or not.

a TL;DR, I'm not the one who made the requirements, nor I'm the one using it for anything, but here is the thread if you need to know more about: #387627 (closed)

Why is "maintenance mode" a Prometheus metric? It sounds like something that should be defined as a domain record instead.

marked this issue as related to #387627 (closed)

mentioned in issue gitlab-com/www-gitlab-com#14546 (closed)

Discuss alternative for measuring "global metrics" in the rails application with prometheus

Designs

Child items ...

Activity

Discuss alternative for measuring "global metrics" in the rails application with prometheus

Relates to

Activity