Guest cache

Motivation

Instead of releasing pool resources when the fully provisioned guest request is canceled, the request would be put into a cache. Future requests may be satisfied from the cache, to avoid the provisioning process or save time by using already configured guests.

In general, tests can be classified as destructive or non-destructive. Obviously, this feature would have little use for destructive tests, but for non-destructive a cache could considerably lower the time needed for provisioning.

tests that do not modify system setup
tests that do modify the system but undo their changes - e.g. systemctl changes that are unrolled, package installation/removal, loopback devices based on files that are then removed, etc.
tests that are confined into their own box, e.g. via chroot or running inside a container.
tests that do modify the system, but in a well-known manner understood by their runner, and the it's acceptable to use such a system again for another similar test.

Note that deciding whether the guest is reusable is out of scope, this is completely up to the Artemis client. Usually, after provisioning, the client applies their own setup, e.g. in a form of Ansible playbooks, to prepare the environment for the next steps. This may make changes to the guest environment, e.g. by installing Docker or Podman, which sounds like a destructive change from Artemis point of view, but if such a guest is then used to run only tests that are encapsulated by containers, it is possible to reuse the guest for more such tests. This is owned by clients, and it's up to them to decide whether it's worth caching a given guest, given their workflow and use cases.

MVP

support provisioning from the cache, release into cache
user CLI: support cache specification
admin CLI: cache management - create, list, remove, list guests, release guests, enable/disable (via knob)
knobs: enabled/disabled, cache size
metrics: cache usage (size, current), hits/misses, removals, dead guests, forced removals
provisioning workflow changes:
- (pre-)routing to decide whether a cache has what was requested
- always assign a cache to a request, even a dummy "invisible" no-op one with size 0
- add a cache retrieval task flow parallel to the current provisioning task flow
implement keep-alive watchdog, testing cached guests

Notes

a cache wouldn't be nothing more than a label, applied to a guest - in general, when "release" is requested by the user, instead of releasing the guest, Artemis would slap a particular cache name on the guest.
caches shall have names and owners, to allow separation of users and use cases.
provisioning must be transparent for the client: acquire/release would carry a new field to instruct Artemis what cache to use, but whether the cache is used or not is up to Artemis. There should be no extra API endpoint for the "provision but from cache" workflow.
we will need to resolve the forced release - given the point above, the client releases the guest as per usual, and the guest may land in a cache, but from the maintainer's point of view, we will need a process how to release cached guests. Similar to DELETE /guests/$guestname, this would probably be under the cache's endpoint, DELETE /cache/$cachename/$guestname.
Artemis will need some sort of keep-alive task, to verify cached machines are still alive - especially with Beaker, we encountered situations when Beaker took the machine back before its reservation expired.
watermark would be needed, to limit the number of cached guests - which opens the question of cache knobs. Enabled/disabled and capacity are the most important ones we need from the start.
needs proper API and CLI support, to manipulate cached guests - release, test whether still alive, add to cache (although this is just a acquire + release with cache name specified)
metrics, of course - current numbers, how many reused, canceled, died, ...
shortened provisioning chain - if the machine exists, jump right before the preparation stage?
I like the term "shelve" instead of "cache" - when not needed anymore, shelve the guest
where to hook cache/not cache decision to? route-guest-request deals with picking a pool for the given request, but when cache can satisfy the request, we don't need to delve into route-guest-request, we can jump somewhere to PREPARING stage. I'm not sure it's wise to extend route-guest-request with caching logic, maybe we need yet another task before this one, to decide the course of action, leaving route-guest-request to focus on actual provisioning only.

Edited Nov 24, 2022 by Miloš Prchlík