Scalable bulk data export

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Context

GitLab unifies the DevOps lifecycle in one tool with one database and data model. This provides opportunities for comprehensive reporting across stages. However, accessing the data itself in bulk is difficult.

Both the REST API and the GraphQL API are paginated for performance reasons. Thus, if reporting is needed on a large number of objects, such as Pipelines, Jobs and Issues, thousands of API calls are needed. This data extraction process either takes a very long time to run or creates a high load on the instance (using multiple threads for parallel querying), risking performance problems.

Use cases

Known use cases of enterprise customers are:

  1. Auditing / Compliance: Audit the whole DevOps lifecycle to have a provable record that all required processes were followed. Although GitLab provides compliance controls like Merge Request approvals that prescribe compliant workflows, in heavily regulated environments, reports are needed that show these were followed.
  2. Business intelligence: Create dashboards and calculate metrics and KPIs needed for business reporting. While GitLab provides a number of dashboards for these purposes, many enterprise customers require more flexible and customizable solutions.

Current solution

A feasible solution right now is reading from the database directly, either via SQL queries or using a tool like pgp. The resulting data can then be imported into a data warehouse for efficient and flexible reporting.

The drawback is that this solution is

  1. brittle, as database schema changes will require query or ETL pipeline adjustments that are difficult to maintain.
  2. difficult to use, as the abstraction from technical database attributes to business objects that the API provides is missing.
  3. unsafe, as direct database access is not a supported use case and should be avoided.

Furthermore, certain objects, such as commits, branches and tags are not kept in the database and thus still require API or direct filesystem / git access.

Proposal

Provide a way to scalably export bulk data (i.e. whole tables, non-paginated or massively increased page size) from GitLab. The scope should be defined by the API, which already provides documentation on the objects managed by GitLab. Ideally provide a query interface that can execute join queries on multiple objects

Edited by 🤖 GitLab Bot 🤖