Guest cache
Part of the Artemis roadmap.
Motivation
Instead of releasing pool resources when the fully provisioned guest request is canceled, the request would be put into a cache. Future requests may be satisfied from the cache, to avoid the provisioning process or save time by using already configured guests.
In general, tests can be classified as destructive or non-destructive. Obviously, this feature would have little use for destructive tests, but for non-destructive a cache could considerably lower the time needed for provisioning.
- tests that do not modify system setup
- tests that do modify the system but undo their changes - e.g. systemctl changes that are unrolled, package installation/removal, loopback devices based on files that are then removed, etc.
- tests that are confined into their own box, e.g. via chroot or running inside a container.
- tests that do modify the system, but in a well-known manner understood by their runner, and the it's acceptable to use such a system again for another similar test.
Note that deciding whether the guest is reusable is out of scope, this is completely up to the Artemis client. Usually, after provisioning, the client applies their own setup, e.g. in a form of Ansible playbooks, to prepare the environment for the next steps. This may make changes to the guest environment, e.g. by installing Docker or Podman, which sounds like a destructive change from Artemis point of view, but if such a guest is then used to run only tests that are encapsulated by containers, it is possible to reuse the guest for more such tests. This is owned by clients, and it's up to them to decide whether it's worth caching a given guest, given their workflow and use cases.
MVP
- support provisioning from the cache, release into cache
- user CLI: support cache specification
- admin CLI: cache management - create, list, remove, list guests, release guests, enable/disable (via knob)
- knobs: enabled/disabled, cache size
- metrics: cache usage (size, current), hits/misses, removals, dead guests, forced removals
- provisioning workflow changes:
- (pre-)routing to decide whether a cache has what was requested
- always assign a cache to a request, even a dummy "invisible" no-op one with size 0
- add a cache retrieval task flow parallel to the current provisioning task flow
- implement keep-alive watchdog, testing cached guests
Notes
- a cache wouldn't be nothing more than a label, applied to a guest - in general, when "release" is requested by the user, instead of releasing the guest, Artemis would slap a particular cache name on the guest.
- caches shall have names and owners, to allow separation of users and use cases.
- provisioning must be transparent for the client: acquire/release would carry a new field to instruct Artemis what cache to use, but whether the cache is used or not is up to Artemis. There should be no extra API endpoint for the "provision but from cache" workflow.
- we will need to resolve the forced release - given the point above, the client releases the guest as per usual, and the guest may land in a cache, but from the maintainer's point of view, we will need a process how to release cached guests. Similar to
DELETE /guests/$guestname, this would probably be under the cache's endpoint,DELETE /cache/$cachename/$guestname. - Artemis will need some sort of keep-alive task, to verify cached machines are still alive - especially with Beaker, we encountered situations when Beaker took the machine back before its reservation expired.
- watermark would be needed, to limit the number of cached guests - which opens the question of cache knobs. Enabled/disabled and capacity are the most important ones we need from the start.
- needs proper API and CLI support, to manipulate cached guests - release, test whether still alive, add to cache (although this is just a acquire + release with cache name specified)
- metrics, of course - current numbers, how many reused, canceled, died, ...
- shortened provisioning chain - if the machine exists, jump right before the preparation stage?
- I like the term "shelve" instead of "cache" - when not needed anymore, shelve the guest
- where to hook cache/not cache decision to?
route-guest-requestdeals with picking a pool for the given request, but when cache can satisfy the request, we don't need to delve intoroute-guest-request, we can jump somewhere toPREPARINGstage. I'm not sure it's wise to extendroute-guest-requestwith caching logic, maybe we need yet another task before this one, to decide the course of action, leavingroute-guest-requestto focus on actual provisioning only.