[REFACT] Introduce new variables & state modular logic for probabilistic models
What does the code in the MR do ?
This MR introduces a whole new logic regarding handling of model parameters, latent variables (realizations) and linked attributes (old attributes). The aim of this change is:
- to have a more modular way of specifying models, so to ease the definition of new models quickly with very few changes (and limited to 1-2 files max unlike previously).
- to be more efficient in computations of derived variables (old attributes)
This MR is the first step towards complete paradigm change that shall be implemented in follow-up MRs:
- it drafts the overall architecture, with main classes for variables that should be almost definitive
- temporarily breaking many models & some functionalities (marked as TODO/WIP/TMP)
Where should the reviewer start ?
- You may want to check first the abstract components (utils) that are quite independent of the rest:
leaspy.utils.weighted_tensor
leaspy.utils.filtered_mapping_proxy
-
leaspy.utils.functional
(in particularNamedInputFunction
logic)
- Then you can look at the whole
leaspy.variables
architecture:
- Specification of variables --> different classes implementing the
VariableInterface
:Hyperparameter
,ModelParameter
,DataVariable
,PopulationLatentVariable
,IndividualLatentVariable
,LinkedVariable
+ a few small classes that are useful -
VariablesDAG
: Directed Acyclic Graph with nodes being someVariableInterface
-
State
: lazy container for values of variables described in a DAG -
StatelessDistributionFamily
(interface to represent stateless distribution families - i.e. no distribution parameters are stored in instance) andSymbolicDistribution
(specification of a symbolic distribution, for priors of latent variables and observation models)
-
Then you can look at the new
ObservationModel
which is built on top ofSymbolicDistribution
component and shall supplant theNoiseModel
(note that models may contain multipleObservationModel
at some point, e.g. for joint models but the full implementation of this case of out of this MR scope) -
Finally you can look at the changes inside the core Leaspy modules to integrate this new architecture:
- in models: AbstractModel > AbstractMultivariateModel > MultivariateModel > UnivariateModel
- in fit algo (perso algo not done for now)
- in samplers (only very minor changes unlike the previous ones)
(Last commit is only CI-related, fixing behavior of CI pipeline when dealing with git branch containing a slash)
What are the main remaining points to implement
Here is a recap of what was flagged as TODO/WIP/TMP in the code:
- Observation models:
- Translate all legit noise models into new observation models paradigm, and remove the
leaspy.models.noise_models
- Design a keyword-based factory for observation models (for most common use-cases) like the one we had for noise models (cf. comment in AbstractModel
TODO? Factory of
ObservationModelinstead?
) + a proper export routine for those (used when requestingmodel.to_dict()
) - Simulation algo is left broken for now, and shall be adapted when final observation model(s) interface is OK
- A bit cleaner
model.initialize
? - Logistic parallel model
- Ordinal model extension (as a dedicated child class?):
- in particular currently masked samplers is not properly handled with new
WeightedTensor
logic - and in stateless distributions we only cope for value being a
WeightedTensor
but not any of the distribution parameters which could be needed for thebatched_deltas
case (but <!> to think about compatibility checks between value & parameters weights if doing so!)
- in particular currently masked samplers is not properly handled with new
- Remove all
leaspy.io.realizations
sub-package when ready (in particular move theVariableType
enum elsewhere, or remove it and re-use thePopulationLatentVariable
vs.IndividualLatentVariable
dichotomy instead) and some legacy code that was left commented for now - Plotting and saving state in
FitOutputManager
(refactoring the plotting part is a broader work per se) - After verification that functional tests pass we may drop the commit f8233851 (cf. below in tests section) & all snippets that were "quick&dirty" adaptation to pass the tests +
var_kws.setdefault("scale", var.prior.stddev.call(state))
line as well + regenerate functional data? - May be clarify the hyperparameters for models:
- there are some hyperparameters that are mandatory to fully specify a model (= be able to instantiate its DAG) (e.g. dimension, source_dimension)
- but also some
Hyperparameter
variables in DAG (like the fixed population "prior" std-dev) - and also some hyperparameters in the initialization of model parameters (e.g. fixed
NOISE_STD/TAU_STD/XI_STD
) - --> may be clarify this in the exported/saved model produced in
.to_dict()
(and reloaded and.load_parameters()
) so not to mix real model parameters with all those kind of hyperparameters
- Missing some unit tests (some unit tests were written inline in abstract components) + adaptation of some existing ones + missing some docstrings...
- No implementation of jacobian for now in logistic model (in realty we could use the torch automatic differentiation so not to explicit ourselves the jacobian of variables in our DAG)
- Reverse model: (biomarker value, individual parameters) -> individual age to reach this value
- (Design decision about model having an internal state modified by algo [current choice] vs. model only storing model parameters and state only instantiated during algos)
- (Not sure crucial:
TODO: find a way to prevent re-computation of orthonormal basis since it should not have changed (v0_collinear update)
in the center_xi operation)
And a few other pending TODOs that were not addressed here:
- Proper architecture for models as discussed during meetings
- Finalize metrics/loss handling in algo/models
- Extend
Data
/Dataset
to be able to cope for different data types (e.g. continuous longitudinal values + survival times observations + ...) - Currently the state is designed to store values of one iteration only but it could be extended to store the values for all iterations if useful (cf. also comment in State:
TODO? ability to fork after several assignments?
)
How can the code be tested ?
For this MR many models were broken, the only model really implemented is logistic (univariate or not) with gaussian observation model.
All the corresponding functional tests (for the fit step) should be working:
FIT (supported models + obs. model only)
pytest tests/functional_tests/api/test_api_fit.py::LeaspyFitTest::test_fit_logistic_scalar_noise tests/functional_tests/api/test_api_fit.py::LeaspyFitTest::test_fit_univariate_logistic tests/functional_tests/api/test_api_fit.py::LeaspyFitTest::test_fit_logistic_diag_noise_with_custom_tuning_no_sources tests/functional_tests/api/test_api_fit.py::LeaspyFitTest::test_fit_logistic_diag_noise_mh tests/functional_tests/api/test_api_fit.py::LeaspyFitTest::test_fit_logistic_diag_noise_fast_gibbs tests/functional_tests/api/test_api_fit.py::LeaspyFitTest::test_fit_logistic_diag_noise
PERSONALIZE (without offending models, nor scipy_minimize with jacobian)
pytest tests/functional_tests/api/test_api_personalize.py
ESTIMATE (only for supported models + obs_models as of now)
pytest tests/functional_tests/api/test_api_estimate.py::LeaspyEstimateTest::test_estimate_multivariate tests/functional_tests/api/test_api_estimate.py::LeaspyEstimateTest::test_estimate_univariate
FIT + PERSONALIZE (simulation turned off)
pytest tests/functional_tests/api/test_api.py::LeaspyAPITest::test_usecase_logistic_scalar_noise tests/functional_tests/api/test_api.py::LeaspyAPITest::test_usecase_logistic_diag_noise
(Note that a special commit f8233851 > TMP: fix order of variables in loops - was introduced to reproduce the exact same order of variables in sampling as of previously in order to pass those functional tests --> once we are sure about the iso-functionality of MR we should drop this commit and should regenerate functional data)
When is the MR due for? (review deadline)
?
What issues are linked to the MR ?
Team discussions about the need for general refactoring of model to easily/properly integrate new models.