I'm currently looking at a simple way to deploy this, ideally using something like cloud run and managed prometheus.
Trying to avoid a full blown Kubernetes setup and custom domain names.
Of course, using the managed systems results in a series of caveats that must be worked through, such as auth limitations.
@nicholasklick my assumption is that this is to be a public demo, but to my knowledge we don't have dashboards integrated into GitLab under the Observability menu yet.
Is this going to be a problem? Do we just demo GOUI in isolation for the moment?
@joe-shaw demoing GOUI in isolation is ok since this would just be an internal demo for the team to show possibilities. How does that sound? We would not socialize it. I just want to understand if this is an easy low-hanging feature we can deliver.
Yes, it's fairly straightforward as an external datasource, as I will hopefully show. It'll be temporarily hosted on a public IP so I can show it in observe.gitlab.com rather than just running locally.
Doing this demo does feel somewhat tangential to actually having this component in our own cluster though, as discussed in https://gitlab.com/gitlab-org/opstrace/general/-/issues/89#note_1249885300. There are more barriers to getting that working, such as having long running API access to GitLab without some user initiated auth flow.
The default configuration may have a problems with fetching too much data from the API by default, and as such became a problem for GitLab.com API requests. Context in gitlab-org/gitlab#387850 (closed)
I'm just writing up my findings now. I haven't found this project to work particularly well for a number of reasons, due to API limitations, but also some configuration problems.
The limitations of the exporter, particularly around configuring a sensible look-back window, make it problematic to use in its current state.
Even a moderate sized group with a few active projects would cause a very large metric footprint that wouldn't necessarily show the user up-to-date information.
Pros:
Project codebase is well structured, and relatively up-to-date in terms of Go version and dependencies.
The configuration seems very flexible, which may better suit our needs for group specific settings.
Supports a HA mode with redis cache, although we probably wouldn't use this.
We can scrape group level projects specifically, which would fit into our tenant model:
# Dynamically fetch projects to monitor using a wildcard (optional)wildcards:-# Define the owner of the projects we want to look for (optional)owner:# Name of the owner (required)name:foo# Owner kind: can be either 'group' or 'user' (required)kind:group# if owner kind is 'group', whether to include subgroups# or not (optional, default: false)include_subgroups:false
Cons:
No official releases since August '22. This may indicate lack of maintenance, especially since the latest release adds newer features that are yet to be completed.
There are a few issues that indicate some features are not working as expected, which may be due to GitLab API changes. The maintainers are still active in the codebase and issues, but not frequently.
When setting any of these values the refs for branches and tags are consistent but randomly missing lots of new references. Age related info seems to be interpreted incorrectly.
Looks like the code uses the "Pipelines" API to infer these refs if these fields are set. It could be an issue with our API rather than this exporter.
When running without these settings on a relatively large project we quickly run into 429 errors (too many requests), even with the inbuilt throttling. For https://gitlab.com/gitlab-org/gitlab-runner/ I could not get 100% of the last weeks metrics before this happens.
Does not explicitly handle 429 (too many request) errors with a back-off strategy. Instead, exceptions are logged when the library attempts to deserialize the response body.
The number of API requests increases by an order of magnitude when we enable job level data from pipelines. We need to be careful about doing this, especially if we're deploying many of these exporters (1 per group or root namespace). This is throttled by default, the caveat being of course that it takes a long time for all metrics to be available.
Resource Utilization
Using the following config:
project_defaults:pull:refs:branches:enabled:trueregexp:".*"# Public projects so we aren't exposing anything not already availableprojects:-name:gitlab-org/gitlab-runner
just the one project, with all branches.
Both exporter and prometheus CPU usage remains low throughout runtime, with periodic spikes for API requests and scrapes.
Prometheus memory usage rises steadily given the number of metrics:
And when we start hitting 429 errors, the memory profile flattens:
The exporter memory footprint never exceeds that of prometheus, which is good.
While initially promising, I ran into a few problems using this exporter effectively.
I was in the process of putting together a demo, when I realized that a bunch of pipeline data was missing and that I could not cross reference many of the recent pipelines for the projects I was testing.
If you'd still like to finish the demo and show these issues I certainly can.
Just having a think about how we could move this forward.
I think one of the limitations of the exporter is that its default operating mode is to try and scrape every pipeline and ref from a project. This gives some valuable historic information, but has a significant footprint for any project that has a decent age and level of activity.
We could investigate the GraphQL API to gather all this info in larger batches, you can see all the available fields here. In theory this could significantly reduce the number of in-flight requests.
Initially we would not get the depth of metrics, nor historical metrics, from projects, but it would likely be a stable long-term solution.
@joe-shaw Great ideas. I had another similar idea with mimicking the Datadog CI Visibility integration - they are using the event hooks to subscribe to e.g. pipeline events, and convert that into metrics and traces.
I did some initial research with reading the code and analysing the flow in &41 (comment 1220256544) Maybe this can be a path with OpenTelemetry as a small GitLab integration app that sends the data to an Otel endpoint which itself has plugins to ingest and transform the data. cc @kbychu
We can offer a relatively simple exporter, but what about contributing to the project to complete things like most_recent and max_age_seconds and other things that limit how much it relies on GitLab APIs?
@kbychu we certainly could try and fix the existing exporter for our use cases.
My hesitancy however would be around how quickly we could iterate, especially given there are a number of open PRs in the project, and aside from dependabot automated updates, the most recent material code change PR was in October.
Another avenue would be to reach out to the maintainers and see if they would allow members of our team to join.
heya, absolutely! Indeed, I did not have much time to look into the project over the past year. I am happy to help looking into some of the blockers you may have as well as welcoming new maintainers!
@mvisonneau if you had time to look over the above findings and give any feedback that'd be great.
@kbychu we should probably have a spike to onboard the project and try to fix some of the issues we've found. We would of course be careful not to break any backwards compatibility with the exporter. Once we're comfortable with it, we can look into a process of maintainership with @mvisonneau.