Currently artifact servers are unbounded and will simply grow until system level errors prevent it from functioning correctly, at which point artifacts can no longer be pushed; and manual intervention is required on the server.
This is very similar to #135 (closed), except that the problem of expiring least-recently-used artifacts on the remote server is much more complicated, all because currently if the user only has read-access to the artifact share, it is impossible with the current implementation to record the activity of having downloaded or asked for an artifact.
As an initial step, we might consider deleting the server cache entirely once a quota is reached.
Later, it will be interesting to enhance the protocol with a smart script which is safe to run on the server for unauthenticated users, so that we can record the last time an artifact was used, and then implement a smarter LRU expiry mechanism.
Edited
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
FYI: This is starting to be a big issue for us, as we have to recreate the cache ostree repo from scratch every time the disk gets full. Solvable but still quite annoying
Note as per the monthly irc team meeting held today: a stop gap solution to this will suffice to begin with, to solve the problems currently seen by other projects. We could then close this one out and then raise another issue in order to capture the complete solution.
@tristanvb about your comment at #136 (comment 54681961), I think this needs to be fixed in ostree itself; this doesn't have relation with the web server implementation of the cache server, but with the internal ostree repo used to implent the cache istself
We would like to have an ostree command like
ostree --repo=artifacts/ prune --include-refs --keep-younger-than="3 days ago",
as
ostree --repo=artifacts/ prune --refs-only --keep-younger-than="3 days ago"
@jjardon, the thing is this; we want to have LRU expiry (least recently used).
I.e.: We want to remove artifacts which have not been downloaded in a very long time. This is important because you can often have say, a base system image; which was only created once a long time ago, but is used very frequently, and is expensive to recreate, we would not want to expire this artifact just because it was created a long time ago.
This means that, we need to update some registry when the artifact is downloaded, in order to figure out which artifact was used most recently, this involves some protocol or activity to take place on the server.
It's possible that ostree could have some enhancement to allow easier cleanup, but without any server or protocol, it can only cleanup the older artifacts. This is very different from least recently used, it is rather least recently created expiry.
@tristanvb agreed, but even if you discovered the last artifacts used are all the ones that are 2 days old and older, you would not be able to remove them from the ostree repo without the command I have posted before
@jjardon I dont think I understand what you mean here.
This is basically how LRU expiry would work:
How we use ostree to store artifacts:
Artifacts are stored in ostree with a ref named after a cache key
There is only ever one commit for a given cache key
We first have to know which was the latest artifact downloaded
This must require some bookkeeping external to ostree, such that we can bookkeep which artifact (ref) was downloaded last, this should allow us to iterate over artifacts starting with the artifact which was downloaded least recently and ending with the most recently downloaded artifact
We would want a configurable "quota" for the artifact share, so that it stays below, e.g. 100GB
Whenever we try to push an artifact (when the server receives an artifact)
We first iterate over the list, starting from the least recently used artifact, and deleting artifacts one at a time in a loop, until enough free space is created for the new, incoming artifact
Store the new incoming artifact
The way to delete artifacts in ostree is a solved problem, and works as follows:
The first command deletes the ref, you can delete many refs at a time. The second command will prune the repo such that no commit objects remain unless they are reachable via a ref.
I dont think what you suggest is really workable, either:
The --keep-younger-than="3 days ago" would only work with ostree's concept of age
ostree will use this to prune older commits to the same branch, however anyway BuildStream does not use branches really, every ref is a standalone artifact
The concept of age in ostree commits are related to creation dates, ostree does not have any concept or attribute storing when a given ref was used (i.e. accessed for reading)
@tristanvb ah, ok I got you now; so the work here is to identify the last used artifact-key so we can remove them in LRU order
The original ostree issue was open because I actually wanted to clean everything older (as creation date) than a specific date, but of course a LRU is better for this use case
FWIW, we are working towards something similar on the artifact server which should use the same logic as we do locally, but using creation date instead of recently used date.
This is not the ultimate fix, but it will be a better stop gap solution than wiping out the entire cache every time we reach a limit.
It's still important to use LRU expiry at one point, because I suspect base runtimes will often be the kind of artifact that:
@juergbi, @jennis: There is a problem with this which I think was overlooked.
This is high priority so, I'm hoping that we can find an excuse to ignore this for the short term, but here is the problem:
It came up in the remote execution proposal that the CAS Artifact Cache would not support summary files
This means that with CAS, we cannot predict what we're going to download during a build session
This will involve some complex reworking of the scheduler, such that we can:
Just try to download things that we need and fallback to fetching sources and building otherwise
Queue build-of-build dependencies dynamically during pipeline execution, such that we can continue to avoid downloading exorbitant base runtimes for bootstrapping procedures which we dont need when the build deps are downloadable
It stands to reason that if we are going to be expiring artifacts on the artifact cache server, this change is going make the summary file meaningless, because artifacts we thought we could download are likely to disappear during build sessions.
@juergbi: Do you think there is any way we can ignore this problem with semi graceful failures and hope that the negative imacts wont be too high in the short term ?
Otherwise, do you think that you can move up this scheduler related part of your CAS work and land those scheduler related changes as soon as possible, ideally in advance of landing this expiry of remote artifacts work ?
Otherwise, do you think that you can move up this scheduler related part of your CAS work and land those scheduler related changes as soon as possible, ideally in advance of landing this expiry of remote artifacts work ?
This is actually my current focus. I have a branch that appears to be functional in initial tests but it's not quite ready yet for master. I have to do some refactoring for a more sensible structure.