Add support for deleting agent managed resources on environment stop (!188024) · Merge requests · GitLab.org / GitLab

What does this MR do and why?

GitLab-managed Kubernetes resources can be used to provision Kubernetes objects as part of a deployment job. The user can select and configure these objects from a list of supported options. A typical example is a namespace to isolate the deployed application from other services running in the cluster.

When using ephemeral environments (for example, with Review apps), many resources can be created across many namespaces. When the environment is no longer required and stopped/removed in GitLab, the Kubernetes objects are left in the cluster, potentially consuming resources.

This change introduces a new environment template configuration option, delete_resources, which is used to indicate that resources should be removed when the environment is stopped. Possible values are never and on_stop (a third option, on_delete is planned). The default going forward will be on_stop, which is configured in the default template.

Implementation

These changes build upon !186369 (merged) and !186770 (merged), where we persist information about the resources provisioned as part of a deployment job, as well as the deletion strategy specified by the environment template.

When an environment is stopped (Environments::StopService), we check these details to see if the environment intends for its associated resources to be deleted. If yes, a worker is scheduled to handle the deletion asynchronously.

The worker invokes a service that sends KAS relevant details about the environment: the slug, project it belongs to, which agent manages it, and the objects to be deleted. KAS then initiates the deletion process.

Some objects may not delete immediately, as they may depend on other objects that must be deleted first. For this reason, we must poll KAS repeatedly until all objects are reported as deleted. When the deletion service instructs KAS to delete resources, KAS responds with the resources that have not yet finished being removed (referred to as in_progress objects in the response). When a non-empty list of objects is returned, the worker is re-queued after a delay. This process repeats until either an empty response is returned by KAS (meaning there is no work left to do), or the number of attempts is exhausted (five reattempts over one hour), which results in an error status.

References

#507486 (closed)

How to set up and validate locally

You will need:

GDK configured with a runner and KAS enabled, and a Premium subscription.
A local Kubernetes cluster

Create a project, and register an agent in it with the following config:

ci_access:
  projects:
    - id: path/to/agent/project 
      resource_management:
        enabled: true

In the same project, commit a .gitlab-ci.yml with the following content:

deploy:
  environment:
    name: my-env
    kubernetes:
      agent: path/to/agent/project:<agent-name>
  script: echo deploy

The generated pipeline should succeed, which means resources were created in the cluster. You can verify this by running kubectl get namespaces and observing a namespace has been created with the format my-env-X-Y where X and Y are the project and agent IDs respectively.
Go to Operate -> Environments, and select the my-env environment.
Click "Stop", and confirm the prompt.
Re-run kubectl get namespaces and observe that the my-env- namespace from above is either absent or in the Terminating status (meaning removal is in progress).

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #507486 (closed)

Edited Apr 15, 2025 by Tiger Watson

Add support for deleting agent managed resources on environment stop

What does this MR do and why?

Implementation

References

How to set up and validate locally

MR acceptance checklist

Merge request reports