Pre-provisioning

Motivation

Pre-provisioning would speed up the provisioning of the most wanted types of guests. The plan is to manage parts of the cache automatically, periodically refilling the cache(s) as needed to keep the predefined amount of available guests.

vast majority of work runs on guests that share HW configuration, being a default "flavor" used by a client (like Testing Farm service, e.g. t2.small or c3.large). These requests would see shorter provisioning time.
requests towards pools where provisioning takes more time (e.g. Beaker) would see the even greater benefit, thanks to the ratio between the original "long" time and pre-provisioned machines ready to use.
depending on pool backends, the number of pre-provisioned machines can be tweakable and relatively large. For example, with cheap spot requests, maintainers may decide to use 80 % of the available resources just to pre-provision the single most common flavor in the form of spot requests, reserving the remaining 20 % for less common requests.

MVP

given cache(s) keep at least N guests of a configured HW configuration
trivial policy: if $current < $N and $current + $N <= $cache.max-size: acquire()
user CLI: not applicable
admin CLI: enable/disable (via knob)
knobs: N control, enabled/disabled
metrics: provision/don't provision counter, request count

Notes

should be transparent for clients, there should be no need for them to specify they are interested in pre-provisioned guests.
should play well with caches, too - clients using a guest cache might be interested in having always a few guests ready for use. Let's say a minimal amount of free guests in a cache, limited by the total amount of guests in a cache, for example.
could be a standalone component, or even a service, using Artemis metrics and APIs for input and control. However, it may be more natural to implement this service as an Artemis component, similar to "worker" or "scheduler", not as a standalone service (who else would use it, with what service other than Artemis?), and instead of API dispatch actual tasks as needed.
should be task-based, to make use of the already existing mechanism of a workflow split into distinct and atomic steps, with workers ready to run new tasks.
there are two distinct phases: investigation and provisioning. Provisioning is probably straightforward, but investigation and decision-making would benefit from the concept of plugins or policies, similar to what we already have for routing. Instead of one hardcoded pre-provisioning policy, we can have several of them, fully configurable, change them as needed, each with a different approach to when and what machines to prepare.
we can discuss the pros and cons of multiple policies being applied at once, similar to routing policies that reduce the initial set of pools, or whether we want one policy in action at any time. Given the cache relation explained below, it seems natural to have these policies attached to caches (or pre-provisioning shelves).
the decision phase has two questions to answer: what flavors to provision, and how many of them. I can imagine pre-provisioning to focus on one flavor only as well as handling several flavors, based on analysis of current traffic and/or configuration. A policy focusing on one preconfigured flavor would be a good starting point, with the pluggability mentioned above, it would be possible to introduce smarter and wider policies as needed.
policies might need to keep a state somewhere - they will have access to all Artemis metrics, but they might need to analyze past behavior. We don't have experience with this yet, needs to be decided whether DB or Redis is to be used. Or a combination of both.
should provide metrics and logs for perfect observability, various watermarks, decision counters, etc.
needs proper API and CLI support
knobs would be needed, global ones (enabled/disabled, policy enablement, etc.) as well as those needed to control and tweak specific policies

The very first idea is to build this on top of the guest cache, given the overlap in the provisioning area: the client submits a request, with a cache name specified, Artemis compares the request with cache content, and if there are free guests, one of them is "resurrected" and fills the blank fields in the request. This is very similar to pre-provisioning, except for the cache name bit - in general, the user shouldn't need to specify a cache but still enjoy the pre-provisioning.

This suggests we might need some (global? per-user?) "invisible" cache pre-provisioning service would be filling when enabled, a cache without a name but always present unless the user asks for a specific cache, which then suggests the provisioning workflow as implemented today (route-guest-request -> acquire-guest-request -> update-guest-request -> prepare...) would be changed to always go through a cache, there would always be a cache inspected first. With pre-provisioning disabled, this default cache would have no free machines ready, and normal provisioning would follow - with pre-provisioning enabled, there would be guests to grab. The same workflow would work with a specific non-default cache, therefore adding this "invisible" default cache would help us a lot with implementation, reducing the number of exceptions in the workflow, while also giving us a shelf for pre-provisioned guests when clients use no cache (like they do these days).

Caching would also take care of keep-alive accounting, and it would provide usage metrics to build upon.

Edited Jan 11, 2022 by Miloš Prchlík