Once https://gitlab.com/gitlab-org/gitlab-ce/issues/52494 is complete, a dedicated namespace will be created for each environment that is created. When the environment is eliminated, the corresponding namespace will remain.
Further details
Proposal
When an environment is destroyed so should the corresponding namespace be destroyed.
In order to make this configurable by the user, we want to add a new option delete_namespace to <job>.environment.kubernetes. Using the environment.on_stop feature, deleting namespaces could be achieved by two jobs and a cluster with a * environment scope to deploy the review app:
This makes sense to me. @tigerwnz can you see any problems here with feasibility? I imagine that we could basically hook into whatever code is deleting from the ci_environments table (or the namespace table) and then trigger a background worker to delete the corresponding namespace from the cluster.
What exactly is meant by the environment is eliminated? We don't currently destroy Kubernetes namespaces or support deleting an environment (except via api), they are simply stopped when no longer in use. If we had a process to destroy unused environments then that would provide a suitable place to remove the namespace, but we shouldn't remove the namespace while the environment is still present (as it could be started again).
Assuming we have a process to destroy an environment, it should be straightforward to trigger a cleanup worker at the same time. The environment slug created by gitlab-ce#52494 would have to match the namespace, otherwise we would probably need a hard link between Clusters::KubernetesNamespace and Environment.
So I'm imagining some kind of flow when an MR is merged it will delete a branch which will in turn delete the environment from GitLab (not sure if this happens or not) which will in turn cause a delete to the namespace for that environment.
The interesting thing here is that if we deleted the namespace we wouldn't actually need the CI YML code which is currently deleting the deployment to the namespace.
Ideally we could just make a minor change to Auto DevOps to delete the namespace instead of the deployment but the problem is that it won't have permissions to delete the namespace since it only has edit privileges in the namespace and I don't think this allows deleting the namespace.
If we're talking specifically about when environments are stopped (and not destroyed), then it would be simple to trigger a worker to clean up the unused resources - we could trigger it whenever an environment transitioned to stopped (in much the same way that we trigger other Kubernetes-related workers).
My only worry would the scenario where an environment is stopped, we remove the namespace, and then the user restarts the environment and expects everything to still be there. The namespace/service account would get recreated on the next deploy thanks to our JIT work, but any tokens would be regenerated.
if we deleted the namespace we wouldn't actually need the CI YML code which is currently deleting the deployment to the namespace.
Any opportunity to remove code from the Auto DevOps CI YML sounds good to me!
@tigerwnz@hfyngvason I'm wondering, are there any edge cases where we might not be able to delete the namespace when we could create it?
I'm trying to go through them:
If we have a GitLab Managed Cluster, then we've cluster-admin right for the cluster, and we should have no problems removing environments.
If the cluster is not managed by GitLab, the service account for the job might have namespace create rights without delete rights. This would fail the delete job. Would the environment already be stopped by this time? Can I (adjust rights and) re-run the job to remove the environment? What is the expected / optimal behaviour? I'd argue that as delete_namespace: true was added by the user, if the job fails due to RBAC problems it's to be corrected by hand. So we don't want to allow re-runs, and would expect the stop_review job to be otherwise completed (Even if it's final state is failed).
If we have a GitLab Managed Cluster, then we've cluster-admin right for the cluster, and we should have no problems removing environments.
Correct! And the reason GitLab needs to step in is because the service account we pass down to the CI job does not have those rights.
If the cluster is not managed by GitLab, the service account for the job might have namespace create rights without delete rights
Correct! For this reason, I would propose that for non-managed clusters, we do not run a background job to clean up.
Instead, maybe we can draw inspiration from the namespace creation for non-managed clusters? It actually happens in the CI job (implemented here) as opposed to in a prerequisite background job. In this case, the job will fail if the namespace does not exist prior to deployment.
So for deletion, maybe we can set a CI variable if environment.kubernetes.delete_namespace is true as a signal to the CI job that this behaviour is desired? When the flag is set, auto-deploy-image can attempt a delation and if it fails, the CI job fails and can be retried.
What is the default value of environment.kubernetes.delete_namespace ?
Should namespaces be deleted too if the Cluster is not managed, or should we do nothing (currently we only create namespaces if the cluster is managed)
Does a stop job get run when the environment is deleted via the UI
Separately, why do we not have environment.kubernetes.create_namespace CI setting ?
I think the default would have to be false to avoid any surprises.
We should definitely try and provide this functionality to non-managed clusters too, though it will probably be implemented (and maybe even configured) differently (see suggestion from @hfyngvason above). We do still create a namespace if the cluster isn't managed, though I seem to remember this was for legacy reasons.
I believe an environment must be stopped before it can be deleted via the UI, so we can probably assume the stop job runs in this case.
Following on from 2. above, this would be great to add at the same time. Though it would be slightly odd having mismatching defaults (create defaults to true, delete defaults to false)
The user noticed that "On every MR a review app is deployed and once MR is merged almost all resources for that review-app is cleaned up in k8s except namespace."
I think the current workaround is to delete these namespaces manually. Then perhaps clearing the cluster cache to make sure they are in sync.
We're having this issue as well. I tried adding a manual management to this in the on_stop job:
script:kubectl delete namespace $KUBE_NAMESPACE
However, the service account does not have permission to do this. So it would have to be initiated from the cluster-admin account provided to Gitlab itself.
Any news on this?
EDIT: I wrote a small little cleanup tool to cope with this for now, but I'd love it if Gitlab handled it for us. :)
They describe the following use case and issues with the proposed workarounds using kubectl delete:
Running a CI job with cluster admin privileges is not secure. This workaround is thus not sustainable for us and will force us to make decisions when we'll have a security audit.
I expect GitLab to delete the namespaces it creates. We create a lot of branches and if each branch creates a namespace, there will be several thousand empty namespaces in our clusters. That is not a clean solution and part of an end-to-end integration is cleaning up after yourself.
A premium customer with 28 licenses is interested in this feature. Refer ZD ticket (for internal use only). They would like to see this feature targeted for 13.1.
We're interested too. There are now so many namespaces that some tooling times out trying to list them. There is nothing running in them, they are of no future interest.
Even if we were to delete the namespaces using kubectl, there's still the matter of having to clear the cluster cache.
@mbruemmer At the same time, I'm sure it won't fit into 13.2, so I've changed the milestone, and intend to prioritize it accordingly in the coming milestone.
With the GitLab Kubernetes Agent, we might get this out of the box using the manifest project adding and removing the namespaces as it starts and stops review apps. I'd recommend us to wait 1-2 months to see how the agent works in real life situation, and we might want to just support that workflow in the short term.
Looks like &3329 is the Epic for the GitLab Kubernetes Agent feature in question.
With #220912 (closed) being one of the last steps in the Epic's timeline and with that issue being about introducing support for deploying from non-public manifest-repositories I guess that's the right issue to track for progress on this topic for now.
We really need to decide whether we wait for this or if we pursue a different solution. Our k8s is filling up with dead empty dynamic environment namespaces from just one project and we have decided to use that feature with more projects now, so it's going to get worse soon.
@nagyv-gitlab , you've mentioned future builds of K8s agent to provide this functionality.
Would you reference the related issue or share any ETA?
Thank you in advance.
I'm adding this issue to the %Backlog as we are focusing on the GitLab Kubernetes Agent, and there is a linked issue that should take care of cleanup. Still, it has to be said that we have a lot of work before we get here. Please, feel free to provide feedback in either of these issues.
I would also note that if Gitlab is deciding to focus on Kubernetes Agent as the preferred way of deployment then this needs to be clearly communicated to the all users.
cc @nagyv-gitlab
@roman_pertl Thanks for replying! I definitely would love to communicate this as clearly as possible, at the same time I admit that I don't know of any outstanding communication channels for this. I'd be happy to hear your suggestions how could it be communicated more clearly!
@dadummy Thanks to your questions! As you might know the features of the Agent are rather limited today. As a result, deprecating existing features is not planned yet. At the same time, we are focused fully on the Agent, and we don't have bandwidth to extend support on the legacy approach. About moving the Agent to Core, I'd like to invite you to provide feedback in this issue, and feel free to invite others there as well.
I agree that not cleaning up after ourselves is a poor product experience, and we want to fix this in the future. The best I can tell is that I'll make sure to support any community contributions in this area.
I’m seriously disappointed by Gitlab prioritising the Kubernetes Agent as we are currently building our workflow around Helm which, for one, allows us to use hooks for more sophisticated installation and upgrade routines.
The idea of the agent sounds nice and all but there are some inherent benefits to having the 360 degrees of „freedom“ when deploying with a CI job where you can implement basically arbitrary logic compared to only being able to provide pre-rendered manifests. This is just what immediately comes to mind when looking up the Agent without any experience with it and in the end the discussion is probably misplaced in this ticket anyway.
I have just set up dynamic environments on K8s with a stop/cleanup job and found out it keeps breaking after the namespace is torn down.
In my stop job I perform some database cleanup and then at the end remove all related resources by executing
$ kubectl delete ns $namespace
which works fine for the teardown.
But then, when I want to re-deploy to the same namespace/branch, I get errors because GitLab thinks that namespace still exists. The solution is to manually go and clear the k8s cluster cache. I wasn't able to find a GitLab API for this cache clearing, which would have solved all this.
--
Kubernetes agent is not an option for us as we are not a premium customer.
What is the recommended way of cleaning gitlab-created namespaces?
Agreed.....building a k8s integration where GitLab claims "managed", there should be a mechanism for handling automation of namespace syncing. But this is a common use-case, so I wouldn't consider this "managed". I love the rest of the integrations using "kube_namespace", but because of this one issue it sours the pudding, you know....
I thought I was doing something wrong, but it turns out keeping around stale namespaces is a feature? If Gitlab can create a review-X namespace and handle all the internal routing to get it accessible, why is tearing it down after merge seemingly closed off to regular, non-premium users?
10+ seat premium user facing this issue as well. Has anyone found a viable workaround?
We are considering cronJobs to periodically remove "problem" namespaces, but this is a kludge of the highest order and still requires clearing the kubernetes cache in gitlab. What can we do to prioritize this?
Our solution for now was to write a script that used the GitLab API pretty heavily to read the merged or closed MRs from a specific repo or set of repos, find out if there are review apps associated with them, then destroy the review app (if not already), and delete the namespace. We run this as a scheduled pipeline every 24 hours in our cluster management project for the cluster that the review apps run in.
Within the .gitlab-ci.yml, terraform import the Kubernetes namespace into the Terraform state
When the environment is torn down, call terraform destroy.
That gitlab-terraform.sh script in registry.gitlab.com/gitlab-org/terraform-images/stable makes it easier to use terraform commands in a .gitlab-ci.yml context. My PR adds the missing import command.
A minimal Terraform file looks something like this:
# file: tf/review/main.tfvariable "kube_namespace" { description = "Kubernetes Namespace for Review" type = string}resource "kubernetes_namespace" "review" { metadata { name = var.kube_namespace }}
Then something like this in the .gitlab-ci.yml... there are a few warts that are mentioned in the comments.
stages: - plan - reviewreview:plan: stage: plan image: registry.gitlab.com/gitlab-org/terraform-images/stable:latest script: # I keep my patched gitlab-terraform.sh in my project - cd ${TF_ROOT} - cp ${CI_PROJECT_DIR}/etc/gitlab-terraform.sh /usr/bin/gitlab-terraform - gitlab-terraform init - gitlab-terraform validate - gitlab-terraform import kubernetes_namespace.review $TF_VAR_kube_namespace || true - gitlab-terraform plan - gitlab-terraform plan-json artifacts: name: plan paths: - ${TF_ROOT}/plan.cache reports: terraform: ${TF_ROOT}/plan.json environment: name: review/$CI_COMMIT_REF_NAME on_stop: review:stop_review cache: key: review-$CI_COMMIT_REF_NAME paths: - ${TF_ROOT} only: refs: - branches except: - main before_script: # https://gitlab.com/gitlab-org/gitlab/-/issues/295627 # https://gitlab.com/gitlab-org/gitlab-runner/-/issues/27616 - export KUBE_CONFIG_PATH="$KUBECONFIG" - export TF_VAR_gitlab_token_file="$GITLAB_TOKEN_FILE" variables: TF_ROOT: ${CI_PROJECT_DIR}/tf/review TF_ADDRESS: ${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/terraform/state/review-${CI_COMMIT_REF_NAME}review:review: stage: review image: registry.gitlab.com/gitlab-org/terraform-images/stable:latest script: - cd ${TF_ROOT} - gitlab-terraform apply needs: - "review:plan" cache: key: review-$CI_COMMIT_REF_NAME paths: - ${TF_ROOT}/.terraform environment: name: review/$CI_COMMIT_REF_NAME on_stop: review:stop_review only: refs: - branches except: - main before_script: # https://gitlab.com/gitlab-org/gitlab/-/issues/295627 - export KUBE_CONFIG_PATH="$KUBECONFIG" variables: TF_ROOT: ${CI_PROJECT_DIR}/tf/review TF_ADDRESS: ${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/terraform/state/review-${CI_COMMIT_REF_NAME}review:stop_review: stage: review image: registry.gitlab.com/gitlab-org/terraform-images/stable:latest script: # Another wart... you have to externally clone to get the Terraform directory # If you don't really have one, you could instead just inject the main.tf file via echo - git init - git remote add origin https://gitlab-ci-token:${CI_JOB_TOKEN}@gitlab.com/<org>/<repo>.git - git fetch - git checkout -b main -f - cd ${TF_ROOT} - gitlab-terraform destroy environment: name: review/$CI_COMMIT_REF_NAME action: stop when: manual only: refs: - branches except: - main before_script: # https://gitlab.com/gitlab-org/gitlab/-/issues/295627 - export KUBE_CONFIG_PATH="$KUBECONFIG" - export TF_VAR_gitlab_token_file="$GITLAB_TOKEN_FILE" variables: GIT_STRATEGY: none TF_ROOT: ${CI_PROJECT_DIR}/tf/review TF_ADDRESS: ${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/terraform/state/review-${CI_COMMIT_REF_NAME} TF_VAR_kube_namespace: $KUBE_NAMESPACE
Another premium customer has requested for an inbuilt ability to clean up the per-environment namespaces when using GitLab's Review Apps auto-devops feature: https://gitlab.zendesk.com/agent/tickets/241498 (internal link)
Is there still no movement on this? I'm gonna be championing a migration to GitLab in my organisation (likely to be 20-30 premium seats), and this is one massive deficiency.
Very disappointing that this was targeted for release over 2 years ago and has just been constantly delayed, considering it's a real basic piece of functionality that's implied for a "managed cluster".
@rsheasby I understand your disappointment, but can't offer any relief. We are currently focused on building out the core features with the GitLab Kubernetes Agent, and we don't have the bandwidth to deliver this feature. The best I can offer is that we are ~"Accepting merge requests" and if there is a community contribution, we prioritize it above our own direction items.
I'm also a customer, could provide proofs to support agents if it's will increase chances of getting the issue resolved. Currently we are maintaining scripts to remove unused namespaces. It would be nice to have it out of the box.