Drew Devereux requested to merge mccs-876-lrc-refactor into main Jan 17, 2022

This MR implements SP-2077 in accordance with the proposal in https://confluence.skatelescope.org/display/SE/MCCS-852%3A+Proposal+for+SP-2077.

The main elements are as follows:

Decoupling of Tango layer from concurrency mechanism

In the present implementation, the Tango layer is tightly coupled to a specific mechanism for providing concurrency. This is problematic because different devices will provide concurrency in different ways. For example:

a high-level Tango device that only interacts with other Tango devices (such as a TMC device or a subsystem controller device) can achieve concurrency simply by using the event subscriptions and the asynchronous flavour of Tango command invocations. In that case, concurrency is provided by the Tango subsystem, and there is no need for separate concurrency mechanism.
low-level Tango devices that interact directly with hardware may need to implement a polling loop in order to monitor that hardware. Such a polling loop already provides a concurrency mechanism: in order to asynchronously perform an action on the hardware, one simply stashes a directive in shared memory, for the polling loop to pick up on its next iteration. Again, there is no need for a separate concurrency mechanism.
Even where a device does need to implement a concurrency mechanism, there should be flexibility in how this is achieved. If a team wanted to explore an asyncio approach, for example, we would not want to stifle that by insisting on providing them with an incompatible concurrency mechanism.

For all these reasons, we want an implementation in which the Tango layer is decoupled from the specific concurrency mechanism, allowing for it to work with a range of such mechanisms. This has been achieved by moving asynchronous task submission into the component manager (the concurrency mechanism was already implemented in the component manager, but until now we were submitting tasks from the Tango device).

The way in which a Tango device calls a method on the component manager now depends on whether the Tango device expects it to be fast-running or long-running:

If fast-running, it calls
```
component_manager.a_method(*args, **kwargs)
```
and the task is assumed to be complete when a_method returns.
If long-running, it calls
```
component_manager.a_method(*args, **kwargs, task_callback=a_task_callback)
```
and does not assume the task to be complete when a_method returns. Rather, the component manager is expected to implement the method in such a way that the status, progress and eventual result of the task is communicated through calls to a_task_callback. Thus the component manager is not bound to use any particular concurrency mechanism; it can implement concurrency any way it wants, so long as that callback gets called.

Note that this means that the ska-tango-base implementation decides whether a command should be treated as long-running or not. In ska-tango-base, we implement the On() command so that it calls component_manager.on(task_callback=a_task_callback). Thus, every component manager must implement the on method so that it accepts a callback, and uses the callback to signal task status, progress and result. Not only are they free to use any concurrency mechanism they want, but they also could implement it synchronously; they just have to signal completion through the callback rather than by returning. Thus we are not forcing teams to implement this method asynchronously, but we are strongly signalling that it ought to be implemented that way. This aligns well with @sonja.vrcic's position that the use of long-running commands should be the status quo, not to be deviated from without a rationale.

Command tracker

There are a couple of other aspects that are specific to our Tango interface, and should be left behind in the Tango layer.

We have a Tango attribute named longRunningCommandsInQueue but that word queue is not, strictly speaking, accurate: the Tango device does not know that the concurrency mechanism is queue-based. What the Tango device really means by this is "list of commands that I have submitted but haven't yet been told are in progress." Clearly the Tango device needs this information kept track of, but it isn't reasonable to expect the concurrency mechanism to do this (remember that there may be many different concurrency mechanisms.) Instead, the Tango layer now has a CommandTracker object that uses the task callback calls to maintain a record of the status and progress of each task. Attributes like longRunningCommandsInQueue are thus simply fast queries to the command tracker.
Command IDs are how the Tango interface allows clients to monitor status and progress. They are not really anything at all to do with the concurrency mechanism. Therefore we should leave the Command IDs behind in the Tango layer. To achieve this, the generation of new IDs is performed in the Tango layer by the command tracker, and the task callback that is provided to a method is a partial that knows the ID of the command:
```
component_manager.on(
    task_callback=functools.partial(update_command_info, command_id),
)
```
Thus the component manager and its concurrency mechanism are oblivious to command IDs; they simply call the provided callback, which associates the payload with the correct ID.

Update: Further explanation / rationale for the command tracker, in response to questions:

The job of the concurrency mechanism is to accept tasks and to run them asynchronously. It provides visibility of what it is doing by calling a callback whenever the status or progress of a command changes. This is all it does. It does not try to maintain a summary / model of the overall state of all commands, for two reasons:

That is extraneous to its core business: it can do its job without doing this.
We want to support a range of different concurrency mechanisms, without imposing a single model on them.

The Tango layer, however, can be queried on the overall state of all commands, so it needs a model for that. The command tracker maintains that model.

(By analogy with a game of chess: the concurrency mechanism does not try to keep track of the state of the board. All it does is report when a move has been made: move_callback(white="Nf3"). But the Tango layer does need to know the state of the board, because it publishes a boardState attribute. Therefore it has a BoardTracker object whose job is to maintain a model of the board, which it updates each time it receives a move. When a client asks the Tango device for the state of the board, that request is serviced by the BoardTracker.)

The command IDs are part of the model that is maintained by the CommandTracker. The concurrency mechanism doesn't need them or know them.

Command results are a black box

What a command returns is now irrelevant to the concurrency mechanism: it does not assume that all commands return a (ResultCode, string) tuple. Indeed, it knows nothing about ResultCodes, and it has no notion of whether a command has succeeded or failed. The result of a command is simply a black box payload that gets sent through the task callback.

Instead, the concurrency mechanism uses TaskState to indicates whether a task is REJECTED, QUEUED, IN_PROGRESS, COMPLETED, ABORTED, etc.

TODO: The Tango task status attribute is still returning ResultCode values. We should consider using TaskState to report command status, and restrict ResultCode to reporting actual outcomes. This would imply removing values like QUEUED from the ResultCode enum.

Concurrency mechanism

As mentioned above, we want to support a range of concurrency mechanisms. Teams should be free to implement a concurrency mechanism that best suits their situation. However, we still provide an implementation of a concurrency mechanism, for teams to use if they wish.

Use ThreadPoolExecutor

The concurrency mechanism provided is essentially the message passing mechanism that was already present. However, I very belatedly realised that both the original MCCS implementation and the current ska-tango-base implementation of it, was reinventing the wheel: this functionality is already available in the python standard library as concurrent.futures.ThreadPoolExecutor. The task executor has thus been reimplemented on top of that, and provides pretty much the same functionality, but in only 100 lines of code.

Untracked tasks

A component manager may need to execute a task asynchronously, even if that task has not been originated in the Tango layer. For example, in order to monitor antenna health, MccsAntenna will occasionally need to obtain and analyse a spectrum signature. This is definitely something that needs to be done asynchronously, but it will not be triggered by a Tango command. In such a case, it would be convenient if we could use the concurrency mechanism that we have in place; yet this is a "private" task, and we would not want it to be reported by Tango attributes such as longRunningCommandsInQueue. This is possible in this design, because the component manager can submit a task to the task executor, with an internal callback or no callback at all. The Tango layer thus does not hear about this task, and the command tracker continues correctly to record only those tasks that were submitted from the Tango layer.

Note: At present, the abort_tasks method aborts everything, without distinguishing between tasks that originated from commands and tasks that didn't. We should think about whether this is right.

General loosening of device component coupling

The approach described above, of decoupling device components by passing a component a callback to call instead of an object to act upon, has been adopted in two other places in the code:

Component manager interface

We used to pass the state models into the component manager, and the component manager was responsible for acting upon them when the component state changed. This meant that the component manager had to know about state models and how to use them. Especially in light of recent discussions about whether the state models should be removed, we do not want the state models to be tightly coupled to any other component.

So now, the component manager simply calls callbacks to indicate change of state. There are two callbacks: communication_state_changed_callback (for when the state of communications between the component manager and its component changes e.g. connection to component drops out) and component_state_changed_callback (for when the component manager detects a change of component state). The latter callback is used for all state changes: the component manager doesn't know about state models, so it doesn't know about the difference between operation state model and observation state model. It reports all state changes through that one callback. It is the responsibility of the device to implement that callback by acting on the right state model object in the right way.

Update: I have added an extensive explanation and rationale below, in response to concerns/questions.

The component manager monitors component state, but does not distinguish between all the different special kinds of state like operation state, observation state, health state, etc. That is confined to the Tango device.

Examples:

The component manager announces "My component's power is not on". The Tango device says "Oh, that affects my operation state so I will feed it into my op_state_model so that my State attribute gets updated."
The component manager announces "I am now resourced". The Tango device says "Oh, that affects my observation state, so I will feed it into my obs_state_model so that my obsState attribute gets updated."
The component manager announces "My component has faulted". The Tango device says "Oh, that affects my operation state so I will feed it into my op_state_model so that my State attribute gets updated. Also it affects my health, so I will feed it into my health_model so that my healthState attribute gets updated."
The component manager announces "My component's backplane temperature is 36.5 degrees". The Tango device says "Oh, that affects my backplaneTemperature attribute, so I will update the attribute value for that, and push a change event.

In implementation terms, it means that the component manager only gets given one callback to use for any changes to component state. But it has lots of flexibility in how it calls it:

component_state_callback(power=PowerState.ON)
component_state_callback(fault=True)
component_state_callback(obsfault=False, resourced=False, configured=False) # result of a Restart() command
component_state_callback(backplane_temperature=36.2, is_tpm3_on=True) # specific to a subrack device

The component manager simply reports changes, without knowing or caring what kinds of state are affected by those changes. As stated above, it is the Tango device that has to decide how to map these changes to the state attributes that it reports.

There are two advantages to doing it this way:

If we decide to change the way that Tango devices report their state, we only have to update the Tango device itself; we don't have to update our entire codebase. For example (this is a bad idea but a good example): suppose we decided that resourcing didn't really fit with other aspects of observation state, and split it out into its own attribute. So we now have a resourcingState attribute for whether the device is resourced, and an obsState attribute for configuring / scanning / obsfault / abort, etc. Before, this change would have broken our entire codebase. Now, it simply requires an update to what the Tango device does when it receives a component state callback with resourced=True or resourced=False.
This callback can be called with any keyword arguments at all -- we should think of it as a channel between the component manager and the Tango device, that we can send anything down. For example, MccsSubrack doesn't need an extra callback for reporting subrack-specific stuff like backplane temperature -- for a subrack, a change of backplane temperature is a change of state, so an update gets sent down that component_state_callback channel.

All of the above applies also to the other callback: communication_state_callback. The component manager simply announces "I have lost communication with my component", without knowing or caring how this affects device state. It is up to the Tango device layer to decide what to do with that information. Currently, the Tango device responds to this by feeding it into the op_state_model, which ends up pushing the State to UNKNOWN. If we also wanted this to affect the healthState attribute, then that would simply require feeding it into the health_model as well.

Separation of commands from state model.

We also used to pass the state models into the command classes, and the command classes would drive them directly. Again, this meant that the command classes had to know about state models and how to use them. But most of these command classes don't need to drive the state model at all. It is only the commands with transitional state, such as AssignResourcesCommand, that need to drive the state model, and even these only need to report on two events: the command starting, and the command finishing.

Again, we adopt the callback pattern: we have the option of passing a callback to the command initialiser. If a callback is provided, then it is called when the command starts and when it finishes. It is up to the device to implement that callback by acting on the right state model object in the right way. Thus the command knows nothing about state models.

Update of command classes

The removal of state models from the command classes led to a substantial refactor:

We don't need a StateModelCommand any more, because we don't pass the state model into our command classes
We don't need an ObservationCommand any more, because we don't pass the state model into our command classes, and furthermore they make no distinction between operation and observation state
We don't need a CompletionCommand any more, because completion is signalled by calling the callback
We don't really need a ResponseCommand, because all this ever did was force us to return a (ResultCode, string) tuple, and the Tango layer does this anyway, so the command classes might as well treat the command result like a black box payload.

Instead we have two basic command classes, and two special cases. The two basic command classes are:

FastCommand: a command class for commands that are fast and synchronous. When the class is called, it signals that it has started, runs the do hook, and signals that it has finished.
SlowCommand: a command class for commands that are slow and therefore implemented asynchronously. This is passed an extra callback at initialision, through which it is informed when the asynchronous work is complete. When the class is called, it signals that it has started, runs the do hook, but doesn't signal that it has finished until it receives a call on that callback.

The two special cases are:

DeviceInitCommand: For most commands, users have lots of freedom in how they initialise the command class. But the InitCommand is called by SKABaseDevice.init_device(), which always calls it with the same argument. DeviceInitCommand is just a FastCommand with the initialisation signature mandated.
SubmittedSlowCommand -- SlowCommand is a generic command class for slow and asynchronous commands. It doesn't know anything about our device model, so implementing a command such as OnCommand as a direct subclass of SlowCommand, requires a lot of boilerplate to get a command ID from the command tracker, create the partial callback, submit the command to the component manager, etc. SubmittedSlowCommand implements all that boilerplate. Thus, commands that subclass SubmittedSlowCommand are much much simpler, without all that boilerplate. Better still, more than half of the command classes do not need to be implemented at all!

Other stuff

Folder structure reorganisation

We have said all along that the reference component managers are for testing and teaching purposes. This was not very clear in the structure, indeed I found it quite confusing. The reference component managers are now in ska_tango_base.testing.

Segfault workaround

I had a torrid time with segfaults on push_change_event. In the end I came to the conclusion that EnsureOmniThread does not work as advertised. I implemented a workaround as described in https://skao.slack.com/archives/CECSS44LX/p1641535095015200. This has completely eliminated my segfaults, but it ain't pretty. I'd love to find a better solution. One of the ugliest consequences is that the tango harness has to sleep 15ms, to let the device state settle, before it yields to the test.

Not done

The following tasks are not yet done:

The integration tests have been lost.
The documentation has not been updated
I just noticed that I forgot to restore two subarray_component_manager attributes; I should do that.

Advice for reviewing

It's a really big MR, sorry. I think, however, that all of the design changes are in the following five files:

base_device.py
base_component_manager.py
reference_base_component_manager.py
executor.py
commands.py

If you do a thorough review of those files, then you'll have reviewed and will understand all the key design changes.

You might want to follow that with

subarray_device.py
subarray_component_manager.py
reference_subarray_component_manager.py

and you'll see that we're really just applying the same concepts to the subarray.

All other file changes are simply applying the same concepts across many commands / devices / component managers. So it's really only code style that needs attention there.

Edited Feb 02, 2022 by Drew Devereux

[MCCS-876] LRC refactor