Users are building many Docker images as part of their pipelines. Many of these images are only needed for a short time. There isn't a good way for developers to delete these images, so they frequently don't. This results in either ballooning storage costs, or the Admin trying to remove images manually via the Container Registry API, which is error prone and inefficient.
User stories
I as a systems administrator, need to ensure that my development teams are expiring unused images, so that I can run garbage collection, delete them from storage and lower the cost of storage.
I as a developer, need the ability to adjust my project's and Docker repositories' expiration policy, so that I can ensure that my important images (ones required for a given release) are not accidentally removed.
Our self-managed customers can set (and forget) a reasonable expiration policy that is enforced by default for every new project in their organization.
GitLab.com enforces expiration policies on all new projects, so that more images are eligible for garbage collection, so that we can, once online garbage collection is available, greatly reduce the cost of storage for the gitlab.com Container Registry.
Proposal
MVC
Introduce a new user interface that will allow project owners to view and update tag expiration policies for all new projects.
Project owners may enable / disable the feature at the project level in Project--> Settings --> CI/CD --> Tag Expiration Policies
Allow project owners to update the default policy at the project level for all Docker repositories in the project:
Expiration interval
Expiration schedule
Retain at least n tags
Expire tags matching the name (regex allowed)
User Interface (These designs are still in motion and are subject to change)
CI/CD Settings --> Docker Tag Expiration Policies
Docker Repository view (not included in this issue)
What we are not doing in the MVC
It's important to note that this issue does not include tag retention policies. This is the because we are building this feature using the existing GitLab bulk delete API, which does not specify which tags to keep. We will do this, but it won't be included in the MVC.
The MVC will not include policies at the Docker repository level, that will be done with: #37242
But it became more important for CI and registry integration. I've setup build and pushing images on each build for easy testing result by QA. Our rpm artifacts weight is around 200Mb in image layers it much bigger (without external squashing f.e. It store intermediate layers) and it will fast exhaust available space.
I think this is really important feature. In workflows where someone use container registry to store images created in CI jobs, it sometimes happens that image is being created on each commit, which means that it is not unusual to consume around 2-4 gigabytes of storage each day in such projects.
Without retention policy this is quite difficult to remove old images (there is no bulk delete feature). It is also quite common that only the latest image is relevant.
Maybe it would be nice to have a configurable retention policy, something like:
This does seem very important. I wonder if a workaround (or even long-term solution) is to use S3 for storing images and then set an expiry policy on the S3 bucket.
I worry a little bit about the potential complexity of this. Removing intermediate images seems easy, but if it's later tagged as a production release, for example, you don't want to delete it anymore. Is that easy to handle? What about checking the last time an image was pulled, not just when it was pushed? If an otherwise deletable image is in active use somewhere, should it still be flushed?
@grzesiek's proposal does seem like the right place to start.
It'd probably require some kind of functionality where repositories can maintain regex whitelists/blacklists and associate a timeframe with them as well. So I could configure things like:
tags that match this regex keep always:
tags that match this regex delete after 1 month
tags that match this regex delete when branch with the same name is deleted/merged
I propose a configurable registry-garbage-collection where at least the administrator of the gitlab instance can configure the retention policy for all images in general. In this configuration it must be possible to exclude images from the garbage collection with a specific tag (e.g. latest, production-*).
I would like to see this within the registry section of the gitlab.rb.
Whereby the period for an image is not count from the push date but the date the image was last pushed/pulled. The period value should follow a time format where characters define the unit (I don't know if this has a specific name), like 30d = 30 days or 5m = 5 months.
The ignore_tags options needs to accept regular expressions for tags to fit.
I think this can be easily integrated into the garbage-collection-process (compared to project level retention) and will give us a fast advantage over fast growing registries. I also propose to put this into the community edition since a growing registry is especially a problem for non-enterprise projects with limited amount of money/storage.
The solution provided by @moenka would be sufficient for us to, instead of our provided solution gitlab-ce#51456, but it should be configurable on Gitlab.com also.
Currently we are tagging container images with version numbers in the registry, see next:
Tag
Tag ID
Size
Created
1.3.0
9a368659e
81.22 MiB
2 months ago
1.2.0
9abf5d6c4
77.71 MiB
6 months ago
1.1.0
d9097950f
77.57 MiB
6 months ago
We could add a latest and previous tag to these images. So for example add the latest tag to the same image that is tagged as 1.3.0. And add the previous tag to the same image tagged as 1.2.0.
If I configure the removal of images in the registry as suggested by @moenka :
With this configuration I would expect that only the image with the tag 1.1.0 will be deleted and that the images AND tags with 1.2.0, 1.3.0, previous and latest would NOT be deleted.
So the 1.2.0 and 1.3.0 tags should also remain in the registry, because the related image is also tagged with one of the ignore_tags, being latest or previous.
I like the concept of a regex to apply exceptions to the default container expiration. It could work without any configuration, with some sane defaults like preserving latest.
Regex proposal:
Set default container expiration at the instance level
Set default regex overrides at instance level, e.g. latest should be kept
Allow groups and projects to override both of these settings
Another option is to allow users to define a specific expiration when pushing a container. This could be done in conjunction with a first class container building definition: https://gitlab.com/gitlab-org/gitlab-ce/issues/48913
I don't think that this should be part of configuration, but rather set of defaults assigned to given container images repository. Having that part of .gitlab-ci.yml makes it really hard to understand what should be executed and when, as you can have multiple versions of configurations (across different branches) and we don't know which one is correct one.
I would rather see something separate, conceptually different from .gitlab-ci.yml as it is imho not really a place to describe that. It should rather be policy that is administratively/maintainerly applied. Maybe we could simply allow to define policy on container image as are parameters of this api: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/24303?
@ayufan I also think putting this into .gitlab-ci.yml may be cumbersome. It's fundamentally different than artifacts in a few ways. There may be exceptions though, where one project may need to deviate from registry policy.
When thinking about labels, I think one underlying reason is to differentiate between ephemeral images used in CI/CD, and released images. i.e. the existence of a tag isn't sufficient as you should be tagging images with git SHAs, for example. Regex seems like the best suggestion for differentiating there, but might not solve everything.
the existence of a tag isn't sufficient as you should be tagging images with git SHAs
Yes, folks are probably not using the image@SHA, so for dev/test branches I expect commit SHA's to be the most common.
For folks who are doing CD though, they may also be using commit SHA's rather than tagging a build number. You may not want to expire these in the same way you would a dev/test branch. For example you may have a blackout period over the holidays, tax season, etc, where an image could run for quite some time.
One way to solve this however, would be to apply an additional tag to images that are deployed. For example your regex could match commit SHA formats, but exclude any image which also has the deployed tag.
@ayufan can you confirm how the regex could behave on images with multiple tags?
One other idea, is to define a convention based around a specific tag like expire-xxx.
For example if we see a tag that fits that pattern, say expire-14d, we could automatically do so without requiring any regex's.
When thinking about labels, I think one underlying reason is to differentiate between ephemeral images used in CI/CD, and released images. i.e. the existence of a tag isn't sufficient as you should be tagging images with git SHAs, for example. Regex seems like the best suggestion for differentiating there, but might not solve everything.
We have separate repositories within project. People can prefer to use /pipelines vs /releases. They, as repository can have different retention policies. I don't like mangling tag names, it is not place for this.
We have separate repositories within project. People can prefer to use /pipelines vs /releases. They, as repository can have different retention policies. I don't like mangling tag names, it is not place for this.
That's an interesting idea, and also supports a convention model rather than requiring regex config. It would also help separate the noise when browsing the registry, once we have a better UX.
One concern with this though would be that if you had existing images, you would have to update the location of everywhere it is consumed.
I don't like mangling tag names, it is not place for this.
What is your concern on tag names? I realize it may not be very clean, but you would in theory only have to apply to your ephemeral images used for testing.
Anything released would not need the extra tag, since you wouldn't want to expire.
@marin@twk3@WarheadsSE do you have some thoughts/ideas here, since you projects are some of the biggest generators of images internally and are also consumed externally off the GitLab registry?
In thinking about the expiration tag idea a little more, there is the significant downside that tags are not exactly like labels, in that many systems treat them pretty first class. For example Quay, Docker Hub, and most importantly ourselves show a list of tags in their interfaces rather than images, which means you would see a lot of noise.
Note that a cleaner approach, in my opinion, is the approach GCR has which lists the images and their associated tags on the same line. This would reduce noise in the UI, make it more clear which images share a tag, but also reduce the noise the expiration tags would generate otherwise. (If we opted to go that route.) It's a much nicer method regardless of which expiration method we choose, I think.
Note that our approach of listing by tags is also very confusing, in that if you click the delete action on a tag you are actually deleting the whole image and all other tags: https://gitlab.com/gitlab-org/gitlab-ce/issues/21405. Listing by image id would align the UX with what actually happens, and what is available in the API.
We would like to see this feature in our EE instance as well! Garbage Collection is fine, but only addresses part of the problem. Project Level Retention Policies would be a great way to give our users control over their custom workflows and needs!
This would be really good to get, as it will allow us to give developers some easy control over which images we keep in the registry - at the moment they just have to live with whatever I choose when I remove tags to allow the garbage collector to free up some diskspace.
One thing I though of when reading this issue: Just keeping everything tagged matching some regular expression (e.g. production-.*) comes with the risk that developers will just tag everything with a new tag matching that regular expression, e.g. production-20190520, production-20190521, production-20190522, ...
As a Gitlab EE admin , I need to restrict users how many images they can keep per repo depending on retetion peroid for images.
Please help before my storage is Full becuase of stale images.
For now, are you able to clean up the storage manually? You can untag items and run garbage collection, although I recognize it doesn't solve the systemic problem.
How to run garbage collection? Do I need to go to every user project and
start untagging thier images? Seems impractical as there are hundredes of
them , even if i can automate this by using API , I wont do that because I
want users to do that so that they can decide themselvs which images to
keep and which to discard.
There are several ways to run garbage collection, including on a schedule. It's not policies and automation, but we are working towards that feature set now.
Even using the API you can only untag repositories for one project at a time (as far as I have found out, I would like to be proved wrong). I wrote a script to iterate over projects (we have over 3000 so it's not fast) and untag any but the 20 newest. It's crude, and probably not what our developers would want but it allows me to keep the diskusage limited, and our developers wouldn't like if it ran out (it's not a seperate partition, so GitLab would probably fail).
We have to move this issue back a bit further. We are currently working through how to bring the container registry from viable to complete/lovable and the ability to do in-line garbage collection and set retention and expiration policies are critical features. We just have some more work to do before we can implement this feature.
Please don't wait for all of gitlab-ce#62885's features before starting work on this issue. In my CE install, it's resulted in a ~100% hosting cost increase over the last year and it's only getting worse. It's also made me consider switching to another repo manager... that disk space needs to be cleaned up.
@alexanderlabrie Have you been able to run garbage collection for your instance? It's not automated, but you should be able to clean up disk space at least.
In-line garbage collection and retention/expiration policies are definitely the most important issues for us and what's driving the discussion behind gitlab-ce#62885
I've been using the new gitlab-ctl registry-garbage-collect -m command as per gitlab-ce#25322 if this is what you're referring to but I still have ~300gb of blobs.
That is what I am referring to. :/ And have you been able use the API to untag unused images, so that they can then be deleted with the gitlab-ctl registry-garbage-collect -m command?
Was unaware of this feature. Just tried curl --request DELETE --data 'name_regex=.*' --data 'older_than=1month' --header "PRIVATE-TOKEN: (token)" "https://(url)/api/v4/projects/(id)/registry/repositories/(id)/tags" and got a 404. Access token has all permissions and I've been able to run curl --header "PRIVATE-TOKEN: (token)" "https://(url)/api/v4/projects/(id)" so I think the problem is with /registry and what comes after.
It works now: turns out I was using the wrong repository ID. For anyone who might run into this issue: I used curl --header "PRIVATE-TOKEN: (token)" "https://(url)/api/v4/projects/(id)/registry/repositories" to get the repository ID of each project, then ran curl --request DELETE --data 'name_regex=.\*' --data 'older_than=1month' --header "PRIVATE-TOKEN: (token)" "https://(url)/api/v4/projects/(id)/registry/repositories/(repository id)/tags" for each project, then ran gitlab-ctl registry-garbage-collect -m. Cleared 100gb.
Also, a flimsy thing I've discovered: gitlab-ctl registry-garbage-collect -m should be run both before and after the DELETE commands.
Thanks for the help @trizzi. However, I reverted my GitLab installation to a backup, tried again, and was only able to delete ~20gb of blobs. So I reverted to the backup again and now I'm unable to delete any blobs at all. Coupled with the requirement to run gitlab-ctl registry-garbage-collect -m before running a DELETE command, which I found out when deleting the 20gb, this all feels very flimsy. I'm back to square one.
Sorry about that. So, now with the backup you are back up to 300 GB (or 280)?
One warning I should have given you is that the API can only be run once per hour, so you might be getting throttled right now. If it's helpful, we can have an engineer hop on a call and help troubleshoot the issue.
Either way, I know this is frustrating. Addressing these problems is our teams top priority. Thanks for trying to work through it with me.
A call would be great. Looked at the API rate limit settings and it doesn't seem like it'd be the issue (7200 max requests per user per period of 3600 seconds). I've been repeating the same steps yet I get different results.
@alexanderlabrie I apologize, but I don't think I ever received an email from you. We can still set up a call and work through the issue with tag removal and garbage collection if you are interested. (I just connected on LinkedIn and we can coordinate through there if that works for you)
Thanks a lot @trizzi, @sahbabou and @sabrams for taking the time to help today. Been dealing with this for more than a year... wasted a bunch of money... and now it's finally over. @sahbabou you are a rockstar.
@alexanderlabrie Thanks for joining the call yesterday, and thanks for your kind words. Your compliment has Neil Young's Rocking In The Free World stuck in my head, and since I don't know the lyrics, I'm not sure of what I'm actually hearing, but definitely a win
Thanks for working on what GitLab should be working on. I understand why they're not fixing it. What a disgrace to open source.
I digress: tried installing pyenv on Debian unsuccessfully. Plus I don't have docker on my GitLab server. None of this would be needed if GitLab took less than 3 years to address this issue.
I'm using it on our CI and reduced Gitlab registry from 100GB to 6GB
The idea behind this script is to launch it as a job inside each project CI because it needs the project name/id.
On the README there is an example with a scheduled job.
P.S.: You don't necessarily need pyenv, you could run it without virtualenvs, just look at the Dockerfile, you need python=^3.4 installed and tell poetry to not use virtualenvs
@icamacho@nkipling@NicoOchoa@dcroft This is a meta issue that can definitely be broken down into a few smaller issues, but I wanted to get all of my thoughts down in one issue. Can you take a look and let me know what you think?
I'm not sure of the feasibility of incorporating the policy into gitlab.rb. I would be open to project level policies with some default settings as an MVC if that makes sense. And we can add and enforce instance level policies via gitlab.rb later. I'm also wondering about the feasibility of giving the option of removing images on merge.
I'm hoping that once we align on the above, we can decide on and implement an MVC for 12.5.
Please add instance level policies, either via gitlab.rb or via API, as soon as possible, and preferably before you add project level policies.
We have a lot of projects, and (most of) our developers don't want to bother with this (and I don't want to force them), i.e. most of our projects won't set anything, but I still need a way to keep this under control (there's a limit to how much storage we can reasonably throw at GitLab to store these images (and cloud storage is not an option for security reasons).
There is a risk that the Docker API can be not extended to support group and instance level due to performance concerns. IF this happens, it means that we similar to the garbage collection issue, we will have to fix this within our own fork of Docker. This is our long term plan anyway, but it may take a bit longer. We are hiring Go engineers and will get this work scheduled as fast as possible, but I'm hopeful we can make some improvements and deliver value even sooner.
@trizzi reading through the issue it looks like we're solving an important problem here. A UX mockup would really help for understanding the retention policy configuration, at the moment reading the description that's one thing that's unclear to me. The other big thing, which you list in open questions, is what reasonable defaults means. I'd recommend starting a thread in this issue with that being the topic of conversation - maybe start with what you think is a reasonable proposal, and then everyone can provide feedback and refine it into something that works well.
Another question - it looks like this is a project level setting, so we should consider an API for people who have lots of projects to be able to bulk set policies. It can be in a follow up issue, but we should do it. You should probably also have follow up issues created for group and instance level so you can follow the demand for those capabilities. I'd recommend including your follow-up plan here (and issue links) in the issue description.
One minor thing that stood out is the background seems to be a restatement of the problem to solve, maybe they could be combined.
Feedback Wanted: Topic: Default Settings for Tag Retention and Expiration Policies
As our MVC of the feature we will auto-enable the feature on all NEW projects. It will be at the Docker repository level and allow for configuration at the project level. Users will be able to set policies based on:
Remove tag names that are matching the regex COMMIT_SHA
Estimated backend weight 5. This will be large enough that it will likely be worked on over a 5-6 MRs. My initial thought is that after the first few which will lay out the associations and any database changes, the rest could be worked on in parallel by multiple engineers. @nmezzopera, my initial take is that some backend design will need to be decided on before the frontend can commence (deciding the data structure of the retention_policy model), does that line up with your initial thoughts?
Backend unknowns:
How do we schedule recurring jobs?
Understanding if there are any other ~performance concerns not yet discovered in utilizing the existing GitLab bulk delete API for this functionality
@sabrams I agree with you frontend is best started after the data model and possibly a contract for the API is established. I have a suggestion to best parallelise the work, if the first MS adds feature flag ( on the controller, to send up to frontend ) we can start developing the buttons / ui element to show the retention policies in the container registry, and maybe try to make a couple of proof of concept so that we can iterate on the UX.
The benefit is not enormous but is something that could allow a better degree of parallelisation.
Estimated frontend weight of 3 with the assumption that we are going to have a private or public API to interact with
@icamacho@trizzi I guess answering this would help me and @sabrams decide on how to split FE / BE work, or even better if this can be an iterative step to take after a first MVC release
I just checked the designs and the answer is yes! The questions that stays is if it brings value to do 'set it' and 'check that is set' in two different issues / mr / releases ?
@trizzi Thanks for the answer! Sorry I did not explained myself properly what I meant is:
Would be okay to open a followup issue where we move everything regarding the 'display of the retention policies' (aka design number two )
My reasoning behind this is that saving and 'enforcing' the retention policies and displaying them are two different set of work:
we save/set the retention policies in the settings page
we will use the project api to set/get the retention policies
we see the retention policies in the container registry pages
we will use the container registry API / controller to return if a container registry has a policy applied
Can we proceed in that way? Does it make sense ( we could also leverage the epic nesting system for this ) ? I am happy to create the new issue and populate it with context
@trizzi is it likely that down the road we will want to explore expiration/retention policies for packages too? My gut reaction is that packages are very different and generally not as disposable as container tags/images, but I thought it was worth asking as it might be something to keep in mind as the BE data structure is designed.
@sabrams good question. As you mentioned, container tags/images are more disposable than packages and are treated differently. I think for packages, we will be more focused on setting and enforcing project/group level limits vs. expiration/retention policies.
@trizzi agreed! I also feel like if we offer a "remove packages built from this branch" for packages built from non-master branch packages is a primary tool for managing the registry size.
@trizzi should we update the designs in the issue (and @nmezzopera subsequent issue)? I noticed that the description has our old version while the design tab has newer/more refined versions.
I am also happy to make that change after we've reviewed a design if you'd like.
@nmezzopera No problem, I just created this epic to better capture all of the related and subsequent issues. Feel free to add any issues to it for better tracking.
@axil Since this issue is part of this epic &2270 (closed) I was wondering if the documentation work concerning this issue is big enough to warrant a standalone issue? WDYT?
In my mind since this is a whole new section of the docs it would make sense!
@axil Here it is! #38078 (closed) I did not assign it to you directly, there is going to be probably a bit of frontend work too ( so I labeled it that way )
Should the feature be titled Policy instead of Policies? As far as I can see it, there's one policy per project, and under its umbrella, you can set various things.
If you check the other items under the Settings, some of them are sentence case some of them are not. We should be consistent. What about titling this "Container Registry tag expiration policy"? Same goes for all other titles inside this block, like "Tag retention policy", "Tag expiration policy", etc.
Should the feature be titled Policy instead of Policies? As far as I can see it, there's one policy per project, and under its umbrella, you can set various things.
For the MVC, it should probably be "Policy". In future iterations, each container repository will have its own unique policy based on the project/group level default. When we get there, should we switch it to "Policies"?
If you check the other items under the Settings, some of them are sentence case some of them are not. We should be consistent. What about titling this "Container Registry tag expiration policy"? Same goes for all other titles inside this block, like "Tag retention policy", "Tag expiration policy", etc.
@nmezzopera the helpers you were asking about are cadence_options, keep_n_options, and older_than_options, all defined in app/helpers/container_expiration_policies_helper.rb
Tim Rizzichanged title from MVC: Container Registry tag expiration policies to Remove Docker images from the GitLab Container Registry using a project level policy
changed title from MVC: Container Registry tag expiration policies to Remove Docker images from the GitLab Container Registry using a project level policy
Tim Rizzichanged the descriptionCompare with previous version
@dcroft@trizzi@sabrams I've updated the workflow label to workflowstaging but now I am not sure if this issue is broader than just the settings page + the backend, feel free to place it back if needed
@nmezzopera, how much work would it be to NOT display the form if the project does not have a policy? One of the items that we overlooked is that these policies should only be allowed on new projects, not existing ones, because an existing project may have many thousands of tags and the bulk delete would run very slowly if the policy is set up to delete large amounts of tags.
@sabrams What do you think about opening an issue where we can discuss all the implications? We are in no rush due to the feature flag anyhow right? Also what about the API?
@nmezzopera Yes, I need to make an update to the API too. This came from this discussion: #37242 (comment 273249662), many implications have been getting discussed here so I don't think we need an additional issue (especially because this was a condition originally defined in this issue). With the feature flag on, we are somewhat safe, but I'd like to see that get removed on .com soon if the rest of the feature is ready to go.
I can see why it's attractive to only do this for new projects, but many of us has been hoping this would become a better way to clean up the repositories for old projects, that can take a lot of disk space for no gain - and I believe most of us would expect that to be quite slow. So if you decide to go forward with restricting this to new projects, please remember that extension would be welcomed.
@sabrams I totally missed that it was defined here in this issue! I will open an MR to 'block' the settings module to be used if the project does not have container_expiration_policy_attributes dictionary, does this sounds good to you?
Thanks for the note @hcgrove. The intent is absolutely to remove that restriction as soon as the tag deletion job can be optimized, as it is a limitation of the docker distribution API that causes a large backup (our testing shows that the job can take days or more for even medium sized registries). This initial release will be the minimal first step towards that goal so we know things are working properly before opening it up to the wider registry.
A bug was identified in production testing that needs to be fixed before full release.
An MR was created to discuss and fix the problem: !24424 (merged)
MR Status
!24424 (merged) - In review, depending on comments from security and maintainers, a different approach may need to be taken that would take more development time.