Rules-based orchestration

Review changes
Download
Patches
Plain diff

Chris Martin requested to merge cmart/exosphere:rules-orchestration into master May 07, 2020

Overview 19
Commits 17
Pipelines 25
Changes 16

This is a partial overhaul of Exosphere's asynchronous orchestration for cloud resources. Enough to introduce new techniques and use them for some (but not all) of our orchestration work, with the intention of seeing how it works in practice and evolving the strategy.

Highlights

A UUID unique to each Exosphere client (introduced in !299 (merged)) is now set as metadata on OpenStack resources (initially just servers) created by the client; this is used to determine which OpenStack resources "provisioning orchestration" should be performed on.
- Fixes #324 (closed)
Helpers.RemoteDataPlusPlus (I welcome a better name) is a data structure inspired by RemoteData.
- It adds the following:
  - Timestamps of when data was requested, and when data/errors were received from the API
  - Allow simultaneous representation of "We have this data and we are also loading new data"
- Better handling of errors: storing in our model when fetched data resulted in an error so that orchestration code can act on that error (e.g. "try again 10 seconds after error was received").
Storing current time in the model and using it to populate RemoteDataPlusPlus. Rather than "poll this resource every X seconds", I implemented a new polling strategy which looks at whether results are loading, and doesn't ask for them again if waiting for a response. This will help us avoid beating up a heavily-loaded API.
New orchestration engine inspired by this paper ("Experience with Rules-Based Programming for Distributed, Concurrent, Fault-Tolerant Code"). With nearly every trip through the Elm runtime, it looks for work to do (e.g. provisioning steps on a new server) and issues Cmds.
- Server provisioning previously only happened if the UI was showing a view of that server. With the new orchestration module, now it happens regardless of which view the user is on.

Working notes (kinda messy)

Design points

It should be the job of REST response handling code to update the model (or sub-model) with new information from API... not specify which requests need to be done next
It should be the job of orchestration code to specify which requests need to be done next

Working todo

Each Exosphere client instance generates a strong random value (perhaps UUID) on first launch and encodes it in local storage (implemented in !299 (merged))
When a server is created, Exosphere sets metadata "exoClientUuid"; this is how we know which servers we should and shouldn't do provisioning things to
~~We don't allow user to run most server actions (perhaps except lock/unlock and delete?) until provisioning is complete?~~ out of scope
Use RemoteDataPlusPlus for list of networks, ports
Consider how to handle API timeouts/retries
- Maybe just use timeout in Http module
- Maybe use RemoteDataPlusPlus track the time for when data was requested (is loading) so that we know when it's been too long?
  - Maybe feed the current time into the update function
Do we need to store the time that we last received information about a server etc? TimedWebData?
Rename Task to Job
~~Consider refreshing model.currentTime on demand (on each trip through runtime) rather than every second~~ follow-up scope
Consider switching order of arguments (data and error types) to RemoteDataPlusPlus?
Use orchestration engine to poll servers
Run orchestration every so many seconds (perhaps 2 or 5? iono)
Consider an exoProp Maybe updatedTime for each server, so we can have one server polled more frequently than others
when we receive a server, handle the error case of 404, server was deleted. Remove from model and don't show error to user.

Orchestration implementation notes

Example Goal: "Create server and associate floating IP address"

Meta-condition: Our New Server
- Server was created by this Exosphere client instance (UUID matches)
- Nova metadata property "exoProvisionComplete" is "false"
- Server was created recently? (say, last couple hours?)
Task: Assign floating IP address to new server
- Rule: Poll status of new server
  - Conditions:
    - Meta-condition "Our New Server"
    - Server status is not yet Active
    - Last poll was >5 seconds ago? iono
  - Action:
    - Poll OpenStack for server status
- Rule: Ask for Neutron networks
  - Conditions:
    - Meta-condition "Our New Server"
    - We don't have a list of Neutron networks (WebData "NotAsked" or "Failure")
  - Action:
    - Ask API for Neutron networks
    - Set Neutron networks in model to "Loading" status
- Rule: Ask for Neutron ports
  - Conditions:
    - Meta-condition "Our New Server"
    - We don't know about a Neutron port belonging to the server, OR we don't have any Neutron ports at all (WebData "NotAsked" or "Failure")
  - Action:
    - Ask API for Neutron ports
- Rule: Create a floating IP address
  - Conditions:
    - Meta-condition "Our New Server"
    - Doesn't have a floating IP address
    - We have a list of Neutron networks
    - We know about a Neutron port belonging to the server which doesn't have a floating IP associated
  - Action:
    - Create a floating IP address for the server

Rule evaluator should run with every pass of the Elm runtime, perhaps as long as there are unmet goals (how do we know this?) Rule evaluator should also run every few seconds, or something?

Questions for self to answer

How to avoid excessive/duplicate RPC (API calls)?
- Keep track of outstanding API calls? In the model?
- We can use "Loading" for list of Neutron networks, but how to keep track of when we already have Neutron ports but have asked again because we don't have the port we want?
  - Answer: RemoteDataPlusPlus
- Is there a way we can run the rules evaluator with every trip through the runtime without polling server status each time?
  - Answer: Yes
- ~~Ooh, maybe a separate "port" WebData property for the server props.~~
What's the difference between asynchronous RPC and "event-based" or "message-based" approaches?
- Answer
  - OpenStack API is RPC: requests result in responses
  - Elm runtime is event-driven: Cmds are sent, and Msgs are received
  - Orchestration engine doesn't respond directly to events. Instead it just looks at the state of the app (including the state of in-flight RPC) and does (mostly asynchronous) things to move OpenStack resources toward the desired end state
How to know when provisioning is complete? Do we need a boolean "exoProvisionComplete" server property (probably not), or should we deduce this by presence of (e.g. a floating IP and server password tag)?
- Answer for now: we're currently using Helpers.serverNeedsFrequentPoll, it's not optimal.
Problem: How to track loading status of all servers when we request just one server.
- Should we only request all servers to make the app simpler (albeit a heavier API user)?
- Should we use an RDPP (RemoteDataPlusPlus for each server?
- Answer: RDPP-ish exoProps for each server indicating
  - Timestamp of when information about a particular server was received more recently than information about other servers
  - That information about only one particular server is currently loading

Follow-up work (TODO create issue(s) when this MR is merged)

#338 Completely replace context-dependent polling code (in State.processTick) with orchestration code (e.g. using it to poll volumes, get server passwords, console URLs)
#339 Handle HTTP errors in remaining API calls as this MR does for (e.g.) [request/receive]Networks, i.e. stop using Rest.Helpers.resultToMsg
#339 If it's working out well, consider replacing other uses of RemoteData with RemoteDataPlusPlus
#340 If possible, refresh model.currentTime on demand (on each trip through runtime) rather than every second
#341 "It might be handy (if only for debugging) to have some (hidden?) indicator that a server was created in the current client (not just by the same user)."

How to Test

Open at least two instances of the Exosphere client, log into the same OpenStack project as the same user. Create a couple of servers. There should not be any errors during server creation (e.g. no duplicate API calls to create/associate a floating IP address for the same server).

Everything else should work at least as well as it did before.

Screenshots

No visual changes

Edited Jun 03, 2020 by Chris Martin

Merge request reports

Assignee

Reviewers

Request review from

Time tracking