Spike: Investigate gitlab-ci-pipelines-exporter

changed milestone to %15.9

added groupobservability typefeature labels

assigned to @joe-shaw

added devopsmonitor sectionops labels

changed the description

added workflowready for development label

added priority3 label

mentioned in epic gitlab-com/marketing&3657 (closed)

added workflowin dev label and removed workflowready for development label

added backend label

I'm currently looking at a simple way to deploy this, ideally using something like cloud run and managed prometheus. Trying to avoid a full blown Kubernetes setup and custom domain names.

Of course, using the managed systems results in a series of caveats that must be worked through, such as auth limitations.

I've decided to go with AWS ECS for making this available from simple docker compose formats, which seems straightforward.

It should also be the cheapest option.

@nicholasklick my assumption is that this is to be a public demo, but to my knowledge we don't have dashboards integrated into GitLab under the Observability menu yet.

Is this going to be a problem? Do we just demo GOUI in isolation for the moment?

@joe-shaw demoing GOUI in isolation is ok since this would just be an internal demo for the team to show possibilities. How does that sound? We would not socialize it. I just want to understand if this is an easy low-hanging feature we can deliver.

cc @kbychu

Yes, it's fairly straightforward as an external datasource, as I will hopefully show. It'll be temporarily hosted on a public IP so I can show it in observe.gitlab.com rather than just running locally.

Doing this demo does feel somewhat tangential to actually having this component in our own cluster though, as discussed in https://gitlab.com/gitlab-org/opstrace/general/-/issues/89#note_1249885300. There are more barriers to getting that working, such as having long running API access to GitLab without some user initiated auth flow.

I don't intend to configure any particular authentication mechanism for this demo either, unless you think we should.

@joe-shaw Can we doc the limitations, opportunities and any followups? I would like to discuss with the team whether this is worth pursuing.

cc @kbychu @mappelman

This is exciting! Thanks @joe-shaw @nicholasklick !

marked the checklist item Investigate https://github.com/mvisonneau/gitlab-ci-pipelines-exporter as completed

marked the checklist item Setup an example project using an external prometheus with GOUI as completed

FYI, this project is listed in the pipeline efficiency docs, and my talks too.

The default configuration may have a problems with fetching too much data from the API by default, and as such became a problem for GitLab.com API requests. Context in gitlab-org/gitlab#387850 (closed)

Thanks @dnsmichi for these references.

I'm just writing up my findings now. I haven't found this project to work particularly well for a number of reasons, due to API limitations, but also some configuration problems.

More info to follow...

I have set up a demo project for this, at https://gitlab.com/gitlab-org/opstrace/sandbox/gitlab-ci-pipelines-exporter. This provides local docker compose setup for the exporter and prometheus, as well as the ability to deploy to AWS ECS, which has allowed me to test with our production GOUI.

TL;DR

The limitations of the exporter, particularly around configuring a sensible look-back window, make it problematic to use in its current state. Even a moderate sized group with a few active projects would cause a very large metric footprint that wouldn't necessarily show the user up-to-date information.

Pros:

Project codebase is well structured, and relatively up-to-date in terms of Go version and dependencies.
The configuration seems very flexible, which may better suit our needs for group specific settings.
Supports a HA mode with redis cache, although we probably wouldn't use this.

We can scrape group level projects specifically, which would fit into our tenant model:

# Dynamically fetch projects to monitor using a wildcard (optional)
wildcards:
  - # Define the owner of the projects we want to look for (optional)
    owner:
      # Name of the owner (required)
      name: foo

      # Owner kind: can be either 'group' or 'user' (required)
      kind: group

      # if owner kind is 'group', whether to include subgroups
      # or not (optional, default: false)
      include_subgroups: false

Cons:

No official releases since August '22. This may indicate lack of maintenance, especially since the latest release adds newer features that are yet to be completed.
- There are a few issues that indicate some features are not working as expected, which may be due to GitLab API changes. The maintainers are still active in the codebase and issues, but not frequently.
Useful settings most_recent and max_age_seconds to limit number of requests for each class of pipeline type (ref, tag, branch) don't seem to work (see https://github.com/mvisonneau/gitlab-ci-pipelines-exporter/issues/343).
- When setting any of these values the refs for branches and tags are consistent but randomly missing lots of new references. Age related info seems to be interpreted incorrectly.
- Looks like the code uses the "Pipelines" API to infer these refs if these fields are set. It could be an issue with our API rather than this exporter.
- When running without these settings on a relatively large project we quickly run into 429 errors (too many requests), even with the inbuilt throttling. For https://gitlab.com/gitlab-org/gitlab-runner/ I could not get 100% of the last weeks metrics before this happens.
Does not explicitly handle 429 (too many request) errors with a back-off strategy. Instead, exceptions are logged when the library attempts to deserialize the response body.
The number of API requests increases by an order of magnitude when we enable job level data from pipelines. We need to be careful about doing this, especially if we're deploying many of these exporters (1 per group or root namespace). This is throttled by default, the caveat being of course that it takes a long time for all metrics to be available.

Resource Utilization

Using the following config:

        project_defaults:
          pull:
            refs:
              branches:
                enabled: true
                regexp: ".*"

        # Public projects so we aren't exposing anything not already available
        projects:
          - name: gitlab-org/gitlab-runner

just the one project, with all branches.

Both exporter and prometheus CPU usage remains low throughout runtime, with periodic spikes for API requests and scrapes.

Prometheus memory usage rises steadily given the number of metrics:

And when we start hitting 429 errors, the memory profile flattens:

The exporter memory footprint never exceeds that of prometheus, which is good.

@nicholasklick @kbychu I've given a summary of findings above.

While initially promising, I ran into a few problems using this exporter effectively.

I was in the process of putting together a demo, when I realized that a bunch of pipeline data was missing and that I could not cross reference many of the recent pipelines for the projects I was testing.

If you'd still like to finish the demo and show these issues I certainly can.

Just having a think about how we could move this forward.

I think one of the limitations of the exporter is that its default operating mode is to try and scrape every pipeline and ref from a project. This gives some valuable historic information, but has a significant footprint for any project that has a decent age and level of activity.

I expect we'd be able to write a relatively simple exporter just using the Pipelines API initially. You can pull basic state just from the paginated results at https://gitlab.com/api/v4/projects/32149347/pipelines. Specific pipelines can be requested directly for more detailed metrics, e.g. https://gitlab.com/api/v4/projects/32149347/pipelines/773577055.

We could investigate the GraphQL API to gather all this info in larger batches, you can see all the available fields here. In theory this could significantly reduce the number of in-flight requests.

Initially we would not get the depth of metrics, nor historical metrics, from projects, but it would likely be a stable long-term solution.

@joe-shaw Great ideas. I had another similar idea with mimicking the Datadog CI Visibility integration - they are using the event hooks to subscribe to e.g. pipeline events, and convert that into metrics and traces.

I did some initial research with reading the code and analysing the flow in &41 (comment 1220256544) Maybe this can be a path with OpenTelemetry as a small GitLab integration app that sends the data to an Otel endpoint which itself has plugins to ingest and transform the data. cc @kbychu

Thanks @joe-shaw!

We can offer a relatively simple exporter, but what about contributing to the project to complete things like most_recent and max_age_seconds and other things that limit how much it relies on GitLab APIs?

Why write something new?

@kbychu we certainly could try and fix the existing exporter for our use cases.

My hesitancy however would be around how quickly we could iterate, especially given there are a number of open PRs in the project, and aside from dependabot automated updates, the most recent material code change PR was in October.

Another avenue would be to reach out to the maintainers and see if they would allow members of our team to join.

Tagging @mvisonneau as project maintainer - would you be open for the discussed feature ideas, and more maintainers joining your project?

heya, absolutely! Indeed, I did not have much time to look into the project over the past year. I am happy to help looking into some of the blockers you may have as well as welcoming new maintainers!

@joe-shaw WDYT?

@kbychu @mvisonneau great!

@mvisonneau if you had time to look over the above findings and give any feedback that'd be great.

@kbychu we should probably have a spike to onboard the project and try to fix some of the issues we've found. We would of course be careful not to break any backwards compatibility with the exporter. Once we're comfortable with it, we can look into a process of maintainership with @mvisonneau.

mentioned in issue gitlab-org/gitlab#338943

marked this issue as related to gitlab-org/gitlab#338943

Spike: Investigate gitlab-ci-pipelines-exporter

Goal

Definition of done

Designs

Child items ...

Activity

TL;DR

Pros:

Cons:

Resource Utilization

Spike: Investigate gitlab-ci-pipelines-exporter

Goal

Definition of done

Relates to

Activity

TL;DR

Pros:

Cons:

Resource Utilization