Service Responsibilities & Vision
In many of our discussions of the TA3TA2 api it feels like we are talking past each other.
I think a significant reason for this is a lack of shared high-level vision. We each maintain and advocate for our own unstated and implicit D3M vision which heretofore has only been partially illuminated.
Distributed system
Our end-goal vision is of D3M as an ultimately distributed ecosystem with each TA as a coherent, separate service where:
- TA3 is responsible for all problem formulation and ultimate evaluation of pipeline efficacy. TA2 does not intend to solve problems for the problem's sake alone.
- TA3 is responsible for bringing the power of human creativity and experience to bear on model creation. The 'pipeline language' could be thought of as assembly language which humans shape and control but do not directly participate in.
- TA2 is responsible for all direct interaction with primitives - specifics of their implementation and their canonical, program-level ID, are opaque to the TA3 user.
- TA2 is responsible for assembling the best possible pipeline for a problem, either with specific user input or without.
Tightly coupled system
However, using the specifics of pipeline language as an inter-system contract pushes us away from this and towards a much more tightly coupled ecosystem. The strongest indicators for this are:
- Neither TA3 or TA2 'own' the primitives, rather they each independently access and refer to a list of them (with the implication that the shared ID is a sufficiently stable indicator of deterministic behavior).
- The responsibility for pipeline assembly lies with neither TA3 nor TA2 exclusively.
Tight coupling is highly limiting
A tightly coupled system is inherently more fragile and almost certainly less capable. It restricts any performer's ability to use the best possible tool for the job and sends us down a rabbit hole of pinned versions, ambiguous expectations, and difficult debugging.
By making both parties responsible for everything, neither is responsible for anything. What happens when a TA3 submits a 'valid' pipeline and TA2 rejects it? We can avoid situations like these by separating concerns more clearly.
The purpose of TA3 as stated in the program is:
TA3: Human-model interaction that enables curation of
models by subject matter experts. A method and interface
will be developed to facilitate human-model interaction
that enables formal definition of modeling problems and
curation of automatically constructed models by users
who are not data scientists.
Given the user is not a data scientist, it is our belief there is no need to expose so much data-science specific language in the api. Does a SME care about hyperparameter tuning or swapping one model for another? Making that available now is counter to program objectives and will cost us effort to unwind in the future.
The purpose of TA2 from the program docs is:
TA2: Automated composition of complex models. Techniques will be
developed for automatically selecting model primitives and for
composing selected primitives into complex modeling pipelines
based on user-specified data and outcome(s) of interest.
The direct assembling of pipelines is the responsibility of TA2.
If we intend to have a distributed ecosystem with distinct responsibilities, it is critical to operate explicitly with clear contracts and ownership.
Without boundaries, TA3-TA2 becomes essentially one service.
I think by trying to do everything we will in fact make it harder to do anything.