[REFACT] Introduce new variables & state modular logic for probabilistic models (!97) · Merge requests · icm-institute / AramisLab / leaspy

Etienne Maheux requested to merge eti/refactor-variables into v2 May 24, 2023

What does the code in the MR do ?

This MR introduces a whole new logic regarding handling of model parameters, latent variables (realizations) and linked attributes (old attributes). The aim of this change is:

to have a more modular way of specifying models, so to ease the definition of new models quickly with very few changes (and limited to 1-2 files max unlike previously).
to be more efficient in computations of derived variables (old attributes)

This MR is the first step towards complete paradigm change that shall be implemented in follow-up MRs:

it drafts the overall architecture, with main classes for variables that should be almost definitive
temporarily breaking many models & some functionalities (marked as TODO/WIP/TMP)

Where should the reviewer start ?

You may want to check first the abstract components (utils) that are quite independent of the rest:

leaspy.utils.weighted_tensor
leaspy.utils.filtered_mapping_proxy
leaspy.utils.functional (in particular NamedInputFunction logic)

Then you can look at the whole leaspy.variables architecture:

Specification of variables --> different classes implementing the VariableInterface: Hyperparameter, ModelParameter, DataVariable, PopulationLatentVariable, IndividualLatentVariable, LinkedVariable + a few small classes that are useful
VariablesDAG: Directed Acyclic Graph with nodes being some VariableInterface
State: lazy container for values of variables described in a DAG
StatelessDistributionFamily (interface to represent stateless distribution families - i.e. no distribution parameters are stored in instance) and SymbolicDistribution (specification of a symbolic distribution, for priors of latent variables and observation models)

Then you can look at the new ObservationModel which is built on top of SymbolicDistribution component and shall supplant the NoiseModel (note that models may contain multiple ObservationModel at some point, e.g. for joint models but the full implementation of this case of out of this MR scope)
Finally you can look at the changes inside the core Leaspy modules to integrate this new architecture:

in models: AbstractModel > AbstractMultivariateModel > MultivariateModel > UnivariateModel
in fit algo (perso algo not done for now)
in samplers (only very minor changes unlike the previous ones)

(Last commit is only CI-related, fixing behavior of CI pipeline when dealing with git branch containing a slash)

What are the main remaining points to implement

Here is a recap of what was flagged as TODO/WIP/TMP in the code:

Observation models:

Translate all legit noise models into new observation models paradigm, and remove the leaspy.models.noise_models
Design a keyword-based factory for observation models (for most common use-cases) like the one we had for noise models (cf. comment in AbstractModel TODO? Factory of ObservationModel instead?) + a proper export routine for those (used when requesting model.to_dict())
Simulation algo is left broken for now, and shall be adapted when final observation model(s) interface is OK

A bit cleaner model.initialize?
Logistic parallel model
Ordinal model extension (as a dedicated child class?):
- in particular currently masked samplers is not properly handled with new WeightedTensor logic
- and in stateless distributions we only cope for value being a WeightedTensor but not any of the distribution parameters which could be needed for the batched_deltas case (but <!> to think about compatibility checks between value & parameters weights if doing so!)
Remove all leaspy.io.realizations sub-package when ready (in particular move the VariableType enum elsewhere, or remove it and re-use the PopulationLatentVariable vs. IndividualLatentVariable dichotomy instead) and some legacy code that was left commented for now
Plotting and saving state in FitOutputManager (refactoring the plotting part is a broader work per se)
After verification that functional tests pass we may drop the commit f8233851 (cf. below in tests section) & all snippets that were "quick&dirty" adaptation to pass the tests + var_kws.setdefault("scale", var.prior.stddev.call(state)) line as well + regenerate functional data?
May be clarify the hyperparameters for models:

there are some hyperparameters that are mandatory to fully specify a model (= be able to instantiate its DAG) (e.g. dimension, source_dimension)
but also some Hyperparameter variables in DAG (like the fixed population "prior" std-dev)
and also some hyperparameters in the initialization of model parameters (e.g. fixed NOISE_STD/TAU_STD/XI_STD)
--> may be clarify this in the exported/saved model produced in .to_dict() (and reloaded and .load_parameters()) so not to mix real model parameters with all those kind of hyperparameters

Missing some unit tests (some unit tests were written inline in abstract components) + adaptation of some existing ones + missing some docstrings...
No implementation of jacobian for now in logistic model (in realty we could use the torch automatic differentiation so not to explicit ourselves the jacobian of variables in our DAG)
Reverse model: (biomarker value, individual parameters) -> individual age to reach this value
(Design decision about model having an internal state modified by algo [current choice] vs. model only storing model parameters and state only instantiated during algos)
(Not sure crucial: TODO: find a way to prevent re-computation of orthonormal basis since it should not have changed (v0_collinear update) in the center_xi operation)

And a few other pending TODOs that were not addressed here:

Proper architecture for models as discussed during meetings
Finalize metrics/loss handling in algo/models
Extend Data / Dataset to be able to cope for different data types (e.g. continuous longitudinal values + survival times observations + ...)
Currently the state is designed to store values of one iteration only but it could be extended to store the values for all iterations if useful (cf. also comment in State: TODO? ability to fork after several assignments?)

How can the code be tested ?

For this MR many models were broken, the only model really implemented is logistic (univariate or not) with gaussian observation model.

All the corresponding functional tests (for the fit step) should be working:

FIT (supported models + obs. model only)

pytest tests/functional_tests/api/test_api_fit.py::LeaspyFitTest::test_fit_logistic_scalar_noise tests/functional_tests/api/test_api_fit.py::LeaspyFitTest::test_fit_univariate_logistic tests/functional_tests/api/test_api_fit.py::LeaspyFitTest::test_fit_logistic_diag_noise_with_custom_tuning_no_sources tests/functional_tests/api/test_api_fit.py::LeaspyFitTest::test_fit_logistic_diag_noise_mh tests/functional_tests/api/test_api_fit.py::LeaspyFitTest::test_fit_logistic_diag_noise_fast_gibbs tests/functional_tests/api/test_api_fit.py::LeaspyFitTest::test_fit_logistic_diag_noise

PERSONALIZE (without offending models, nor scipy_minimize with jacobian)

pytest tests/functional_tests/api/test_api_personalize.py

ESTIMATE (only for supported models + obs_models as of now)

pytest tests/functional_tests/api/test_api_estimate.py::LeaspyEstimateTest::test_estimate_multivariate tests/functional_tests/api/test_api_estimate.py::LeaspyEstimateTest::test_estimate_univariate

FIT + PERSONALIZE (simulation turned off)

pytest tests/functional_tests/api/test_api.py::LeaspyAPITest::test_usecase_logistic_scalar_noise tests/functional_tests/api/test_api.py::LeaspyAPITest::test_usecase_logistic_diag_noise

(Note that a special commit f8233851 > TMP: fix order of variables in loops - was introduced to reproduce the exact same order of variables in sampling as of previously in order to pass those functional tests --> once we are sure about the iso-functionality of MR we should drop this commit and should regenerate functional data)

When is the MR due for? (review deadline)

What issues are linked to the MR ?

Team discussions about the need for general refactoring of model to easily/properly integrate new models.

Edited May 25, 2023 by Etienne Maheux

[REFACT] Introduce new variables & state modular logic for probabilistic models