Bi-directional communication between Exosphere and instances, part 2
This issue supersedes #373 (closed).
Problem/Opportunity Statement
Since #373 (closed) was created, Exosphere has grown to rely on the instance's console output (a.k.a. console log) as a uni-directional communication channel, from the instance to Exosphere, via the Nova API (console output endpoint). We're generally using JSON as the data serialization format. Instances use this channel to make the Exosphere client aware of:
-
Progress of new instance setup
- Example message:
{exoSetup:"complete"}
- Example message:
-
System resource usage (CPU, GPU, RAM, disk)
- Example message:
{"epoch": 1668711122, "cpuPctUsed": 2, "memPctUsed": 5, "rootfsPctUsed": 25, "gpuPctUsed": null}
- Example message:
- JupyterLab authentication token for the experimental workbench/workflow interaction
The biggest limitation is that it's only a one-way channel. Exosphere currently has no way to send data or commands to an existing, running instance. This capability would enable several new features, such as:
- Setting up new features/services on an existing running instance (rather than only at instance creation time).
- User creates an instance with no desktop environment, later decides they want to add one, we could offer an "Enable desktop" button in the UI.
- Making the instance aware of attached volume names, so the mount points make more sense (#831 (closed))
- Attempting un-mount of a volume before detaching it (#298 (closed))
- Handling mounting/unmounting of shared filesystems (#445 (closed))
- Challenge-response authentication to services running on instances (#656)
The opportunity is for a better, more forgiving UX, and more capabilities.
Further, the instance console log has the following limitations:
- Eventually, information in the console log goes away. This happens when an instance restarts, or when the log grows too long.
- This is usually not a problem, but in some edge cases, it can cause Exosphere to miss an important message.
- e.g. create an instance, immediately close Exosphere, and don't open it again until days later. Exosphere will consider the setup status to have timed out, because the
{exoSetup:"complete"}
message will have fallen off the end of the saved console log.
- e.g. create an instance, immediately close Exosphere, and don't open it again until days later. Exosphere will consider the setup status to have timed out, because the
- This is usually not a problem, but in some edge cases, it can cause Exosphere to miss an important message.
- The console log is a noisy channel, it also gets the output of syslog, which exacerbates the previous limitation.
- Cloud operators use the console log for troubleshooting broken instances. System resource usage messages every minute (like the example above) clutter the log for this purpose.
- Two processes can write to the console log at the same time, in which case the characters become interspersed. If this happens while a process on the instance is writing JSON for Exosphere to read, that JSON gets corrupted and Exosphere never gets the message
What would success / a fix look like?
Imagine some sort of agent (daemon) running on the instance, and the Exosphere client can connect to this agent using an HTTP API or websocket (via the cloud's existing UAP server). Here, Exosphere can both send and receive data and commands to/from the instance. We can build new features on this communication channel, and perhaps migrate existing features to use it instead of the console log.
We should try to avoid a centrally-managed service for this that the cloud operator (or we) would need to maintain. We already require a UAP server for the instance interactions to work, we can probably use it to connect to the agent without requiring anything new.