The PoC

Related to Discussion: Separate releasing from publishing ... (#438066 - closed)

This is a PoC to understand what the publishing process of a new release to the CI/CD catalog would look like if we had a CLI and separate Publish API endpoint. In this PoC we separate the creation of the release from the publishing of the release. The reason is primarily not to mix responsibilities since the API endpoint must accept a payload with metadata to attach to the published release. This metadata is specific to the CI/CD catalog resources, so that's why we prefer to have a separate endpoint we could control more specifically.

From an engineering perspective moving to client side metadata processing brings an advantage in terms on not having to worry about compute power. In the future we could extract more metadata from more components or generate more computationally intensive data (e.g. SBOM files).

The CLI would be doing the following:

flowchart LR
  cli([CLI - runs in CI Job]) --> s1[check that it's executed for a tag revision]
  s1 --> s2[check README] --> s3[check project description] --> s4[extract metadata for each component in repo]
  s4 --> release([Create Release for $CI_COMMIT_TAG])
  release --> publish([Publish release to CI/CD Catalog])

Success	Failure

How to set up and validate locally

Create a component project on your GDK
Ensure you have a Runner with shell executor

Ensure you add a .gitlab-ci.yml with a job as follow:

release_and_publish:
  rules:
    - if: $CI_COMMIT_TAG
  script:
    - ruby /path/to/gitlab-development-kit/gitlab/scripts/catalog-publish

Design

Reusing `release-cli` tool

⚠ This plan has not been validated with the groupenvironments that maintains the release-cli and whether any design assumptions conflict with what the release-cli is designed for ⚠

I thought back and forth whether to use a different tool or build on top of the release-cli. I think we could extend the release-cli to perform metadata extraction if the project is marked as catalog resource. To do that, given the limitations with the CI_JOB_TOKEN we would need to expose a predefined variable CI_CATALOG_RESOURCE_PROJECT=true that can be used by the CLI.

This means users would continue to use the release-cli as they do today. We would need to make a few changes in order to make the CLI and server version compatible:

Expose $CI_CATALOG_RESOURCE_PROJECT predefined variable.
Make server-side changes (new Publish API, modify Release API to skip publishing)
Update release-cli to consume the new variable and extract metadata if the $CI_SERVER_VERSION >= X where X is the version containing the new publish API. Otherwise use only the Release API as of today because the server will do the metadata extraction and publishing.
1. In this step the release-cli will use catalog_publish: false when calling the Release API, making it purely a Release API without server-side catalog publishing. The CLI then calls the Publish API separtely.
Issue a breaking change warning to users to use the latest release-cli because in GitLab X.Y we will stop processing server side publishing and the Release API will only release (no publishing).
1. The Release API will eventually ignore the catalog_publish param, making it always false . This will make the release-cli to return creating only releases, as well as creating releases from the UI would just do that. Both release-cli and Release UI won't be compatible with component projects workflows, unless you consciously want to create a release but not publishing it to the catalog.
2. This will force users to use the new CLI if they want to publish to the catalog.
Implement breaking change by removing server-side logic.

⚠ Handling of template components in the CLI means we have to duplicate the logic for extracting metadata out of the CI config header. The source of truth of this logic exists in Rails.
⚠ The release-cli can take input arguments or defaults to the presence of predefined CI variables. The tool is designed to take user inputs and forward them as Release API params. Such inputs are not derived data and must be provided by the user. With component metadata extraction we have mainly derived data (the metadata extracted from files). To avoid that the metadata we collect is different than the data we serve in GitLab, we must ensure that the tool is running in a very specific context:
- runs in a Git repository and the revision checked out is the same tag as the one being released.
- runs against committed files without any uncommitted local changes.

Release and Publish API

The Release API will get a new catalog_publish boolean parameter (default true to maintain the current behavior). The release-cli creates a release with catalog_publish: false so that just the release is created. If the release is successfully created we publish it together with the metadata already collected by the release-cli.

# POST api/v4/projects/:id/catalog/publish?release_tag="0.1.0"

{
  components: {
    templates: {
      component_1: { # name uniqueness ensured through the key
        inputs: {
          stage: { type: 'string', default: 'test', description: "..." }
        }
      }
    },
    steps: {
      component_1: { # technically a component with same name but different type
                    # can exist since we would always load the correct one.
        inputs: {
          concurrency: { type: number, default: 10, description: "..." }
        },
        outputs: {
          result: { description: "..." }
        } 
      }
    },
    images: { # this is a stretch for the sake of future thinking
      image_x: {
        label: '1.0',
        checksum: "..."
      }
    }
  },
  # other ideas below:
  categories: ['security', 'testing', 'reports'],
  readme_path: 'README.md',
  license: 'MIT',
  contact: 'info@example.com'
}

We might also need to treat the POST :id/catalog/publish endpoint "special" in a way that it's treated more like an internal endpoint and not subjected to the same deprecation strategies as the normal public API. We don't expect automations outside the CLI to use this API.
The Publish API endpoint may need to have a way to communicate deprecation warnings to the CLI. This could be done by setting the Warnings: header and ensure that the CLI always prints those, at least. This way we can say things like The parameter `x` is deprecated and will be removed with GitLab 18.0. Ensure to upgrade the CLI to the latest version before then.
Additive changes are fine. Removals of parameters need to follow a deprecation strategy.

CLI - API multi-version compatibility

Because server and CLI are going to be rolled out separately we need to ensure changes are done in a specific sequence. None of this is required today due to the metadata been processed server-side.

To add a new parameter/metadata
1. Add support to the server as optional data (API param, services, database migration, etc.).
2. Add support to the CLI (collect metadata, validate it, send it as param, etc.).
3. Issue a breaking change if the data should become required. Send deprecation warning via API response. Server expects the data always and users must upgrade to the latest version of the CLI.
To remove a parameter/metadata
1. Ensure the server treats it as optional data (API param, services, database migration, etc.).
2. Remove support from the CLI (no collection, no param sent).
3. Issue a breaking change if we want to drop the optional data. Send deprecation warning via API response.

Observations

➕ If we make the changes in the release-cli tool, users that are already using it and the latest version should not see any changes.
➕ The user can see exactly the checks we are executing, what's succeeded, what's failed and why. We extract the metadata and run the checks before even creating a release. After a release is created it's then published in the catalog and a clickable link to the catalog resource is displayed in the job log so that user can see the result on the catalog.
➕ We don't need to worry about performance of the publishing process since the heavy lift is done in a CI job. However, we still need (maybe higher) limits on Rails as we cannot have unlimited components in a release.
➖ We need to watch out for multi-version compatibility issues between CLI and Rails if we need to change the parameters schema.
➖ The CLI works very well when validating data that is defined in the repository (e.g. README, process component files, etc.). It relies on the use of APIs for anything else. The CI_JOB_TOKEN is used in this PoC because doesn't require the user to set and maintain any tokens but it has limited endpoints it can call. Today (this PoC) we cannot use the API to check if the project is marked as catalog resource. If in the future we want to validate more data outside of the repository (e.g. other component types, other settings, etc.) we need to first solve the problems with the CI_JOB_TOKEN limitations.
➖ We would still be relying on some server-side validations to ensure data consistency and to enforce policies. For example, we already know that some customers want to control which users can publish or which group is allowlisted to publish projects.
➖ Not much Rails-side code of today becomes redundant (~20 LOC).
Self-managed installations can pull the image of release-cli from SaaS and/or push it to their container registry.

Cost of doing this

Direct costs: estimation 10-12 person-weeks

Implement the changes for the tool in Go (team does not have expertise today).
Implement the changes for the server (edit Release API, create Publish API).
Convert CI config validation of spec:inputs into JSON schema to be shared with the CLI.
Define process for communicating changes and deprecations (e.g. release post, show warnings from the server in CLI output).
Retire/remove most of server-side processing.

Engineering costs:

Some level of duplication will be introduced between the CLI and the server-side logic (semantic version, validations, schemas, conventions).

Opportunity costs:

Move GA deadline
Delay customer adoption

Cost of doing nothing

What happens if we keep doing what we do today and postpone this decision? I want to make it clear that this is not a "now or never" situation but a "now or later" situation.

There is a belief that we will need to extend the limit of components per release to much higher than the today's 30 components/release, and that we may need to collect/generate more metadata in the future. While this might become true in the future, there is no actual plan as of today. We may be living with the current design for a long time or we may be face its limitations soon.

Scenario 1 - keep it as is

Direct costs: N/A

Engineering costs:

Some level of duplication between Rails and Step Runner when CI steps are going to be supported. This is probably inevitable as Rails needs to be able to validate a YAML syntax that uses CI steps and need to reuse a published JSON schema shared with the Step Runner.
We have a hard limitation on the number of components we can support per release because we are parsing metadata synchronously with the Release API.

Opportunity costs:

Missing a breaking changes window while in Beta. Changing approach later may require a breaking change in GA.

Scenario 2 - have a soft limit of components per release

Rather than failing the publishing if more than 30 components are present we rather extract metadata for at most 30 components. Users can have an unlimited number of components in their project but we don't index them all. They can still document them in their README. This is not a perfect solution but can allow us to overcome the processing limitation pretty easily.

Direct/Engineering/Opportunity costs are pretty much the same as Scenario 1.

Scenario 3 - move metadata processing async

This is my least favourite as tech debt can emerge easily from this. Rather than doing this we may be better doing what's proposed in this PoC.

If we ever do this we should change the meaning of the metadata as being something we use internally (e.g. search) and for presentation of additional details. We won't be able to block a release from being published if metadata extraction fails. The publishing must occur after satisfying some lightweight checks. Metadata extract can happen async but it's not critical.

Direct costs: estimation 5 person-week

Move metadata processing async.
Find creative UX workarounds to surface asynchronous failures.
Ensure system considers metadata as optional.

Personal remarks on decision making

The current implementation of the release process has been working well and proven sufficient given the current requirements and limitations. We are limiting the number of components in a release to 30, which is generous today given that we want to ensure components remain cohesive within a given release. We don't have any customer requirements to increase this limit or to collect more computationally intensive metadata. The current solution is fine if it doesn't need to change.

As described above, if the needs to improve the publishing process arise, especially the metadata collection, we should directly implement the solution outlined in this PoC. We should not attempt to introduce async processing or other solutions that may introduce more technical debt in the long run. This needs to be a clear engineering strategy.

This is not a "now or never" decision but a "now or later" decision.

As we are moving to GA we should definitely take in consideration breaking changes. We are recommending users to use the release-cli to publish their release to the catalog. We strongly discourage the use of the Release UI. I believe that if at any point we need to implement this PoC we can extend the release-cli to do the metadata processing and publishing. As users will pull the latest version of the CLI, their workflow will automatically upgrade to the new client-side publishing as described above. This means that only a minimal subset of projects could face the breaking change: those that use a hard-coded version of the CLI and those in air-gapped installations. They will only need to pull a recent version of the CLI to continue publishing.

Given the analysis above I don't consider this solution to be a blocker for GA and implementing this PoC today will definitely impact the GA due date. There is also a level of risk in doing this now and pushing the GA forward that is associated with unknown unknowns (customer feedback, undiscovered blockers, unanticipated delays, etc.). I am assuming that it's acceptable to take a limited-impact breaking change if we need to change the publishing process in the future. We should be testing the new process extensively along side the existing one, gather feedback and switch when ready.

Edited Feb 15, 2024 by Fabio Pitino

Draft: PoC CI components publishing via CLI