I'd like to propose an option in .gitlab-ci.yml to only keep the latest artifact for each branch or pipeline. Maybe such an option could be used in combination with expire_in to let older artifacts expire after a certain amount of time. A good name for this option could be keep_latest.
Our use case: We automatically download the latest artifact for a few branches from GitLab. Over time quite a few artifacts might pile up, eating a considerable amounts of disk space. Most branches are not updated regularly, so expire_in is not very helpful here.
~"feature proposal"
Proposal
Add an option keep_latest to the artifacts:section in .gitlab-ci.yml files. When this option is set to yes for example, an artifact should only expire when a newer artifact is created.
This is pretty much what I'd like, too, except it would be great to give a number of old artifacts to keep. Something like keep_latest: 5 to keep the 5 most recent.
It sounds a really good proposal. Our product packaging downloads artifacts of multiple repositories and actually we have to set the expire_in to an unnecessary high value to be sure we always have artifacts available in some stale repositories (or we have to trigger build periodically on those repositories).
With that keep_latest option we could reduce the disk cost and make a sysadmin happy !
Thank you everyone for the feedback, I'd like the keep_latest approach and the implications when used along with expire_in, even if I don't know how big is to access information about the number of artifacts that are from previous runs.
We also have some problems to address: how can you define if artifacts from previous pipelines counts? You may change the .gitlab-ci.yml, and this is quite hard to handle. You can change the job name, but want to consider it the same, or you can change the script of the job, and want to consider it a different one. How can you handle pipelines run on different branches? Is it latest three in the same branch, or in general?
Time-based expiration is relying only on the artifact itself, this approach relies on relationships and this introduces complexity and edge cases.
I'd like to know what the community thinkgs about my considerations.
Will try to address points from my point of view... hopefully, it'll be helpful to you :)
Personally I'd be cool with having the change to .gitlab-ci.yml work retroactively, as one of the reasons I want this is to help save space as I've got some pipelines creating ridiculously huge artifacts.
I suppose one way to try to handle this differently would be to treat the artifact as an object and apply an expiration property to the artifact itself. This doesn't really address clearing that property if the .gitlab-ci.yml is changed after the fact though. Someone else might have a better solution for that!
As for changing of the job...
I guess the artifact should be an object with an 'expiration' property same as my previous suggestion..? I'm struggling to find another answer here.
For artifact behavior for specific branches. If the artifacts code block lives under a job definition then there are already ways to define that this is for a specific branch eg: for review branches in my pipeline I'm using: only: - /^review-.*$/
I guess the best suggestion I can offer that sort of covers your points is to handle the artifact as an object in code in the background of gitlab (not sure how difficult that would be as I haven't dug into it myself...) and then give users an interfact something like this maybe?:
# Inside a job context so that branches can be specified if the user wants to.artifacts:paths:-${artifact_path}expire_in:1dkeep_last:3retroactive:True# Default false
@snayler0 unfortunately not yet. The only workaround I see now is to add a job in your pipeline that removes old artifacts using API, but I understand it is nothing more than a quick hack.
Let's say I have a job that triggers during a Rollback which needs a previous job artifact, this would effectively fail if you only keep the latest artifact.
@wakawaka54 For that reason I would prefer having a keep_last: <int> option to keep eg the last 3 artifacts that would be job-specific, rather than having a set expiry time.
This sounds like a great addition, would make our workflow much more streamlined. Would remove the need to fiddle with expiry times on a per project level.
Any news on this feature? In our Gitlab Enterprise edition we now have over 130GB of artifacts on 1 single branch, because it keeps all the older versions. Would be nice if we could only keep the latest.
That would greatly improve our workflow, the time it takes to back-up gitlab and disk space for storage.
One thing I like about the general spirit of the Gitlab-ci is its "stateless" approach. The fact that it is repeatable and the result is predictable.
So when we talk about keeping the "last few builds" I personally hope we talk about keeping the build artifacts associated with the last few commits of a branch.
I suppose that, from this point of view, we could continue to see the build artifacts somewhat stateless and decoupled - i.e. the granularity of a build artifacts would take into account a pipeline instance + job name + expiry / creation date but also a commit hash related to a branch.
Otherwise, if we only consider "the last few" from a time perspective, does it mean that triggering 3 times the same pipeline on the same branch would produce 3 different artifacts to keep? My personal answer to this would be no. Even if I could see why you might want that... it seems to me that this perspective brings a whole lot of complexity with it - i.e. the need for an artifacts manager and all the edge-cases to be handled "properly" (read: best-effort).
EDIT: Cool, doing this feature might actually help with the "rollback" feature not being very reliable right now, see gitlab-org/gitlab-ce#39539
This would be nice when having e.g. both a backend and a frontend project, where building the backend requires the output from the frontend build.
However, another approach that might be even cleaner and would make this whole feature less important would be allowing to depend on a state from another project: Like this I could depend on a stage like 'frontend-repo/build' and if no artifacts from that stage exist, that project's CI would run to build them.
I have to add my voice to this one...
I have Install package artifacts that needs to be available.
Only the last successful build needs to be keept "forever"...
So:
# Inside a job context so that branches can be specified if the user wants to.artifacts:paths:-${artifact_path}expire_in:1dkeep_last:1
@bikebilly Regarding the statelessness. Could you attach keep_last to each artifact, along with expire_in when you create the artifact? Then when you do the hourly cleanup check, you could check the keep_last of the latest artifact only from each branch for the value, and then use that as the basis for cleaning up the older artifacts on each branch, then fall back to expire_in for all artifacts which were greater than keep_last? I.e. skip the last keep_list artifacts before starting to check expire_in.
uses Gitlab deployments API to check which images are in use and was in use, keeping the last n number of images per environment. That way you don't risk deleting images you are actively using or was using short time ago.
uses Gitlab jobs API for getting rid of jobs with artifacts that don't have expire set for artifacts. This takes up a lot of diskspace over time also. And it gives developers an incentive in using artifacts expiry to keep their old jobs around if they have any interest in that.
needs Gitlab new enough to have API v4.
Feel free to test, comment and supply patches to improve it :-)
I am not sure this still make sense, because the new release API, I think all use cases are "covered" now:
artifact => share files between CI jobs, livecycle is CI / short
cache => share files between same CI job occurrences, livecycle is CI / long
release => store files permanently, livecycle is project / permanently
well...
afaict, the "release assets" need a storage, which is currently not
(directly) provided by the releases API.
i haven't found an easy to deploy method to automatically create
releases from a tag in a CI-job yet (but that's probably a problem with
documentation)
A lot of our repositories see bursts of build activity, with relatively large artifacts. To avoid running out of space, we set pretty aggressive timeouts. Unfortunately, this often causes manual jobs to later fail. Having the option to retain the most recent artifacts would be extremely useful.
Earlier in the thread, there were questions about how to key the artifacts (e.g., job name, branch, etc). If that's the hang up, could it work like caching, where it can be customized?
keep_latest would be simple and would add a lot of value, especially when combined with latest artifact links.
We've migrated to GitLab from an in-house CI system (built before CI was mainstream), and it's been heartening to see the GitLab CI feature set grow to cover most of our requirements. FWIW, our internal system used a dynamic pruning algorithm to strike a balance between disk usage and maintaining a spread of representative builds.
We kept:
Every build on the current day
One per hour for previous days in the last week
One per day for builds older than a week, but less than 6 weeks
One per week for builds between 6w and 3m old
One per month for builds older than 3 months
So rather than a boolean option for keep_latest, or a numeric keep_last, it might be worth considering a syntax that allows alternative pruning algorithms in the future, e.g.
GitLab is moving all development for both GitLab Community Edition
and Enterprise Edition into a single codebase. The current
gitlab-ce repository will become a read-only mirror, without any
proprietary code. All development is moved to the current
gitlab-ee repository, which we will rename to just gitlab in the
coming weeks. As part of this migration, issues will be moved to the
current gitlab-ee project.
If you have any questions about all of this, please ask them in our
dedicated FAQ issue.
Using "gitlab" and "gitlab-ce" would be confusing, so we decided to
rename gitlab-ce to gitlab-foss to make the purpose of this FOSS
repository more clear
I created a merge requests for CE, and this got closed. What do I
need to do?
Everything in the ee/ directory is proprietary. Everything else is
free and open source software. If your merge request does not change
anything in the ee/ directory, the process of contributing changes
is the same as when using the gitlab-ce repository.
Will you accept merge requests on the gitlab-ce/gitlab-foss project
after it has been renamed?
No. Merge requests submitted to this project will be closed automatically.
Will I still be able to view old issues and merge requests in
gitlab-ce/gitlab-foss?
Yes.
How will this affect users of GitLab CE using Omnibus?
No changes will be necessary, as the packages built remain the same.
How will this affect users of GitLab CE that build from source?
Once the project has been renamed, you will need to change your Git
remotes to use this new URL. GitLab will take care of redirecting Git
operations so there is no hard deadline, but we recommend doing this
as soon as the projects have been renamed.
Where can I see a timeline of the remaining steps?