Expire artifacts on artifact share servers

marked this issue as related to #135 (closed)

mentioned in issue freedesktop-sdk/freedesktop-sdk#38 (moved)

Note libostree issue opened related with this: https://github.com/ostreedev/ostree/issues/1420

Note libostree issue opened related with this: https://github.com/ostreedev/ostree/issues/1420

Unless ostree wants to include an http server implementation or similar, I don't think this is appropriate to attempt to solve in ostree.

FYI: This is starting to be a big issue for us, as we have to recreate the cache ostree repo from scratch every time the disk gets full. Solvable but still quite annoying

Note as per the monthly irc team meeting held today: a stop gap solution to this will suffice to begin with, to solve the problems currently seen by other projects. We could then close this one out and then raise another issue in order to capture the complete solution.

changed milestone to %Immediate Priorities

changed the description

@LaurenceUrhegyi @tristanvb any update on this?

@tristanvb about your comment at #136 (comment 54681961), I think this needs to be fixed in ostree itself; this doesn't have relation with the web server implementation of the cache server, but with the internal ostree repo used to implent the cache istself

We would like to have an ostree command like

ostree --repo=artifacts/ prune --include-refs --keep-younger-than="3 days ago",

as

ostree --repo=artifacts/ prune --refs-only --keep-younger-than="3 days ago"

is not enough to completely cleanup the cache

@jjardon, the thing is this; we want to have LRU expiry (least recently used).

I.e.: We want to remove artifacts which have not been downloaded in a very long time. This is important because you can often have say, a base system image; which was only created once a long time ago, but is used very frequently, and is expensive to recreate, we would not want to expire this artifact just because it was created a long time ago.

This means that, we need to update some registry when the artifact is downloaded, in order to figure out which artifact was used most recently, this involves some protocol or activity to take place on the server.

It's possible that ostree could have some enhancement to allow easier cleanup, but without any server or protocol, it can only cleanup the older artifacts. This is very different from least recently used, it is rather least recently created expiry.

@tristanvb agreed, but even if you discovered the last artifacts used are all the ones that are 2 days old and older, you would not be able to remove them from the ostree repo without the command I have posted before

@jjardon I dont think I understand what you mean here.

This is basically how LRU expiry would work:

How we use ostree to store artifacts:
- Artifacts are stored in ostree with a ref named after a cache key
- There is only ever one commit for a given cache key
We first have to know which was the latest artifact downloaded
- This must require some bookkeeping external to ostree, such that we can bookkeep which artifact (ref) was downloaded last, this should allow us to iterate over artifacts starting with the artifact which was downloaded least recently and ending with the most recently downloaded artifact
We would want a configurable "quota" for the artifact share, so that it stays below, e.g. 100GB
Whenever we try to push an artifact (when the server receives an artifact)
- We first iterate over the list, starting from the least recently used artifact, and deleting artifacts one at a time in a loop, until enough free space is created for the new, incoming artifact
- Store the new incoming artifact

The way to delete artifacts in ostree is a solved problem, and works as follows:

ostree refs --repo=/path/to/repo --delete artifact-key
ostree prune --repo=/path/to/repo --refs-only

The first command deletes the ref, you can delete many refs at a time. The second command will prune the repo such that no commit objects remain unless they are reachable via a ref.

I dont think what you suggest is really workable, either:

The --keep-younger-than="3 days ago" would only work with ostree's concept of age
- ostree will use this to prune older commits to the same branch, however anyway BuildStream does not use branches really, every ref is a standalone artifact
The concept of age in ostree commits are related to creation dates, ostree does not have any concept or attribute storing when a given ref was used (i.e. accessed for reading)

@tristanvb ah, ok I got you now; so the work here is to identify the last used artifact-key so we can remove them in LRU order

The original ostree issue was open because I actually wanted to clean everything older (as creation date) than a specific date, but of course a LRU is better for this use case

FWIW, we are working towards something similar on the artifact server which should use the same logic as we do locally, but using creation date instead of recently used date.

This is not the ultimate fix, but it will be a better stop gap solution than wiping out the entire cache every time we reach a limit.

It's still important to use LRU expiry at one point, because I suspect base runtimes will often be the kind of artifact that:

Is expensive to recreate
Does not need to be recreated often

mentioned in merge request !421 (closed)

assigned to @jennis

mentioned in merge request freedesktop-sdk/freedesktop-sdk!210 (merged)

@juergbi, @jennis: There is a problem with this which I think was overlooked.

This is high priority so, I'm hoping that we can find an excuse to ignore this for the short term, but here is the problem:

It came up in the remote execution proposal that the CAS Artifact Cache would not support summary files
This means that with CAS, we cannot predict what we're going to download during a build session
This will involve some complex reworking of the scheduler, such that we can:
- Just try to download things that we need and fallback to fetching sources and building otherwise
- Queue build-of-build dependencies dynamically during pipeline execution, such that we can continue to avoid downloading exorbitant base runtimes for bootstrapping procedures which we dont need when the build deps are downloadable
It stands to reason that if we are going to be expiring artifacts on the artifact cache server, this change is going make the summary file meaningless, because artifacts we thought we could download are likely to disappear during build sessions.

@juergbi: Do you think there is any way we can ignore this problem with semi graceful failures and hope that the negative imacts wont be too high in the short term ?

Otherwise, do you think that you can move up this scheduler related part of your CAS work and land those scheduler related changes as soon as possible, ideally in advance of landing this expiry of remote artifacts work ?

Otherwise, do you think that you can move up this scheduler related part of your CAS work and land those scheduler related changes as soon as possible, ideally in advance of landing this expiry of remote artifacts work ?

This is actually my current focus. I have a branch that appears to be functional in initial tests but it's not quite ready yet for master. I have to do some refactoring for a more sensible structure.

Expire artifacts on artifact share servers

Child items ...

Activity