Add support for deleting agent managed resources on environment stop
What does this MR do and why?
GitLab-managed Kubernetes resources can be used to provision Kubernetes objects as part of a deployment job. The user can select and configure these objects from a list of supported options. A typical example is a namespace to isolate the deployed application from other services running in the cluster.
When using ephemeral environments (for example, with Review apps), many resources can be created across many namespaces. When the environment is no longer required and stopped/removed in GitLab, the Kubernetes objects are left in the cluster, potentially consuming resources.
This change introduces a new environment template configuration option, delete_resources, which is used to indicate that resources should be removed when the environment is stopped. Possible values are never and on_stop (a third option, on_delete is planned). The default going forward will be on_stop, which is configured in the default template.
Implementation
These changes build upon !186369 (merged) and !186770 (merged), where we persist information about the resources provisioned as part of a deployment job, as well as the deletion strategy specified by the environment template.
When an environment is stopped (Environments::StopService), we check these details to see if the environment intends for its associated resources to be deleted. If yes, a worker is scheduled to handle the deletion asynchronously.
The worker invokes a service that sends KAS relevant details about the environment: the slug, project it belongs to, which agent manages it, and the objects to be deleted. KAS then initiates the deletion process.
Some objects may not delete immediately, as they may depend on other objects that must be deleted first. For this reason, we must poll KAS repeatedly until all objects are reported as deleted. When the deletion service instructs KAS to delete resources, KAS responds with the resources that have not yet finished being removed (referred to as in_progress objects in the response). When a non-empty list of objects is returned, the worker is re-queued after a delay. This process repeats until either an empty response is returned by KAS (meaning there is no work left to do), or the number of attempts is exhausted (five reattempts over one hour), which results in an error status.
References
How to set up and validate locally
You will need:
- GDK configured with a runner and KAS enabled, and a Premium subscription.
- A local Kubernetes cluster
- Create a project, and register an agent in it with the following config:
ci_access: projects: - id: path/to/agent/project resource_management: enabled: true - In the same project, commit a
.gitlab-ci.ymlwith the following content:deploy: environment: name: my-env kubernetes: agent: path/to/agent/project:<agent-name> script: echo deploy - The generated pipeline should succeed, which means resources were created in the cluster. You can verify this by running
kubectl get namespacesand observing a namespace has been created with the formatmy-env-X-Ywhere X and Y are the project and agent IDs respectively. - Go to Operate -> Environments, and select the
my-envenvironment. - Click "Stop", and confirm the prompt.
- Re-run
kubectl get namespacesand observe that themy-env-namespace from above is either absent or in theTerminatingstatus (meaning removal is in progress).
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
Related to #507486 (closed)