RETE Scheduling Architecture: Documentation & Simplification Plan
# RETE Scheduling Architecture: Documentation & Simplification Plan
**Context:** Issue #10 (Proposals A–F) is complete. This issue documents the current architecture for developer onboarding and tracks simplification opportunities.
**Related:** #10 (completed), MR !1655 (`feature/rete-pure-conditions`) · **Last updated:** 2026-03-09
---
## 1. System Overview
The RETE scheduling system reactively schedules Gap Filling Modules (GFMs) when their preconditions are met, replacing the old poll-every-GFM-every-batch approach. 30 GFMs total: 21 pure RETE, 9 hybrid.
```
┌───────────┐ ┌─────────┐ ┌────────────┐ ┌──────────┐ ┌────────────┐ ┌───────────┐
│ CalcGraph │───▶│ Working │───▶│ Alpha Net │───▶│ AND-Join │───▶│ Production │───▶│ Scheduler │
│ Mutation │ │ Memory │ │ (15 types) │ │ │ │ Node │ │ Loop │
└───────────┘ └─────────┘ └────────────┘ └──────────┘ └────────────┘ └─────┬─────┘
│
┌────────────────────────────────────────────────────────────────────────────────────┘
▼
┌───────────┐
│ GFM.run() │── new mutations ──▶ (back to CalcGraph ── reactive feedback loop)
└───────────┘
```
Mutations flow left-to-right through the RETE network. When a GFM executes, its output mutations feed back into CalcGraph, triggering new RETE activations — a **reactive feedback loop** that continues until the graph stabilizes.
---
## 2. Fact Lifecycle
CalcGraph mutations are converted to working memory facts by `MutationToFactConverter` (in `mutation_adapter.py`), then asserted into `ReteWorkingMemory`.
```
CalcGraph.apply_mutation() ──▶ ReteSubscriber ──▶ WorkingMemory.assert_fact()
```
| Mutation Type | Facts Produced |
|---|---|
| `PropMutation` | `property_value` + `property_type` |
| `PropListMutation` | `property_value` + `property_type` (for list props like glossary_tags) |
| `RawPropMutation` | `property_value` |
| `AddNodeMutation` | `node_exists` + `node_type_hierarchy` + initial property facts |
| `AddEdgeMutation` | `has_parent` (on child) + `has_child` (on parent) |
| `RemoveEdgeMutation` | retract `has_parent` + `has_child` |
| `DuplicateNodeMutation` | `node_exists` + all property facts |
| GFM completion | `gfm_completed` (key=gfm_name, value=True) |
Each fact is a `WorkingMemoryFact(node_uid, fact_type, key, value, timestamp)`. Working memory maintains four indexes: `_facts` (primary), `_parent_idx` / `_child_idx` (structural), `_type_idx` (node types), and `_gfm_completed_index` (dependency tracking).
---
## 3. The RETE Network
Alpha nodes test individual conditions. When ALL alphas for a GFM pass for a given node, the AND-join fires and the production node queues a `(gfm, uid)` activation.
**Example: ConservationGFM** (needs FlowNode + is_subdivision=False + OriginGFM done)
```
┌────────────────────┐
│ NodeTypeAlpha │──┐
│ (FlowNode) │ │
└────────────────────┘ │
│ ┌──────────┐ ┌────────────────┐
┌────────────────────┐ ├──▶│ AND-Join │───▶│ ProductionNode │──▶ "ready"
│ NodeAttributeAlpha │──┤ │ (all 3) │ │ Conservation │
│ (is_subdiv=False) │ │ └──────────┘ └────────────────┘
└────────────────────┘ │
│
┌────────────────────┐ │
│ DepCompletionAlpha │──┘
│ (OriginGFM done) │
└────────────────────┘
```
Each alpha maintains a `memory: set[uid]` of passing nodes. On mutation, only **subscribed** alphas are re-evaluated (via `_property_to_alphas` and `_fact_type_to_alphas` indexes). Relational alphas trigger directional fan-out to parents/children.
### Alpha Node Catalog (15 types)
| Category | Alpha Types |
|---|---|
| **Type checks** | `NodeTypeAlpha`, `NegatedNodeTypeAlpha` |
| **Property checks** | `PropertyExistsAlpha`, `PropertyNotExistsAlpha`, `PropertyTypeAlpha`, `PropertyTypeOrAlpha`, `PropertyValuePredicateAlpha` |
| **Relational** | `RelatedNodeAlpha`, `CrossNodePropertyAlpha` |
| **Compound** | `OrAlpha`, `CompoundOrAlpha`, `CompoundAndAlpha` |
| **Glossary** | `TermValueAlpha` |
| **Attribute** | `NodeAttributeAlpha` (immutable: `is_subdivision`, `is_child_of_root`) |
| **Dependency** | `DependencyCompletionAlpha` (reactive `gfm_completed` tracking) |
---
## 4. Scheduler Loop
```
╔════════════════════════════════════════════╗
║ MAIN SCHEDULER LOOP ║
╚════════════════════════════════════════════╝
│
▼
┌─ Collect ──────────────────────────────────┐
│ Merge init + RETE pending activations │
│ Separate final-phase → deferred │
└──────────────────────┬─────────────────────┘
│
▼
┌─ Execute Batch ────────────────────────────┐
│ Sort: (DAG priority, name, │
│ FlowNode-first, insert order) │
│ For each: filter → spawn → run │
│ Each GFM: begin → run → flush → done │
│ On done: assert gfm_completed fact │
└──────────────────────┬─────────────────────┘
│
▼
┌─ Process New Nodes ────────────────────────┐
│ Scan added nodes → initial RETE eval │
└──────────────────────┬─────────────────────┘
│
▼
┌─ Retry Deferred (fallback) ────────────────┐
│ Re-check depends_on_gfms for deferred │
│ (primary path is reactive via RETE) │
└──────────────────────┬─────────────────────┘
│
▼
┌─ Final Phase (if queue empty) ─────────────┐
│ Sort by final_phase_priority │
│ Execute one-by-one │
│ Pause if new regular work triggered │
└──────────────────────┬─────────────────────┘
│
▼
◇ More work? ──yes──▶ (back to Collect)
│no
▼
DONE
```
The scheduler uses **double batching** when `_use_parallel=True`: an outer batch wraps the entire `_execute_batch()`, while each GFM gets an inner batch for its mutations. Inner flushes trigger selective alpha re-evaluation; the outer flush consolidates new activations for the next iteration.
---
## 5. Filter Activation Pipeline
Every `(gfm, uid)` pair passes through `_filter_activation()` before execution — a 5-step decision pipeline:
```
(gfm, uid) ──▶ final-phase? ──yes──▶ defer to L3
│no
▼
finished? ──yes──▶ skip
│no
▼
attr_cond? ──fail──▶ cancel (permanent)
│pass
▼
deps_met? ──no──▶ defer (fallback)
│yes
▼
can_run_now() ──cancel──▶ cancel
│ ──resched──▶ requeue (limit 250/batch)
│ready
▼
EXECUTE
```
Steps 1–3 are fast checks on static state. Step 4 (`depends_on_gfms`) is primarily handled reactively by `DependencyCompletionAlpha` — the filter step is a fallback safety net. Step 5 spawns the worker and calls `can_run_now()` — only relevant for the 9 hybrid GFMs.
---
## 6. Execution Layers
```
┌─ L1: Static DAG Priority ───────────────────────────────────────────────────┐
│ Built once at startup from GFMCondition + GFMOutput declarations. │
│ 4 inference rules: │
│ required_props → sets_props │ PTC.prop → sets_props │
│ forbidden_props → reverse │ depends_on_gfms → explicit │
│ Kahn's topological sort (alphabetic tie-breaking) → _gfm_priority │
└─────────────────────────────────────────────────────────────────────────────┘
▼ feeds sort key
┌─ L2: Reactive RETE + Guards ────────────────────────────────────────────────┐
│ 15 alpha types evaluate on mutation. AND-join → production → ready. │
│ _filter_activation(): attr check → dep check → can_run_now() │
└─────────────────────────────────────────────────────────────────────────────┘
▼ when queue drains
┌─ L3: Final Phase ───────────────────────────────────────────────────────────┐
│ Perishability(10) → WaterScarcity(20) → Aggregation(100) → MatrixCalc(110) │
│ → Vitascore(120) → PostAgg(130) → ReportAgg(200) │
│ One-by-one. Pause if new regular work triggered. │
└─────────────────────────────────────────────────────────────────────────────┘
```
L1 determines **batch ordering** (which GFM runs first within a cycle). L2 determines **readiness** (whether a GFM should run at all). L3 handles **barrier semantics** — GFMs that need the entire graph stabilized before executing (e.g., Perishability needs all leaf FPFs tagged, WaterScarcity needs all BW node parents created).
---
## 7. GFM Landscape
| Pure RETE (21) | Hybrid (9) — `can_run_now()` reason |
|---|---|
| AddClientNodesGFM | DailyFoodUnitGFM — `gfms_to_skip` flag |
| AggregationGFM (abstract) | IngredientAmountEstGFM — parent nutrient cache |
| AttachFoodTagsGFM | LinkTermToActivityGFM — sub_nodes TOCTOU guard |
| ConservationGFM | LocationGFM — BW node edge case |
| FoodLossGFM | OriginGFM — subdivision/LTAN state |
| GreenhouseGFM | ProcessingGFM — sub_node type check |
| ImpactAssessmentGFM | TransportDecisionGFM — cheapest_mode cancel |
| IngredientSplitterGFM | TransportModeDistGFM — location/origin wait |
| InventoryConnectorGFM | UnitWeightConvGFM — production_amount cascade |
| LinkFoodCategoriesGFM | |
| MatchProductNameGFM | |
| MatrixCalculationGFM | |
| MergeLinkedNodesGFM | |
| NutrientSubdivisionGFM | |
| PerishabilityGFM | |
| PostNutrientsAggregGFM | |
| RainforestGFM | |
| ReportAggregationGFM | |
| VitascoreGFM | |
| WaterScarcityGFM | |
| (DailyFoodUnitGFM pending) | |
The **21 pure RETE** GFMs have no `can_run_now()` override — RETE conditions fully express their scheduling logic. The **9 hybrid** GFMs use RETE for primary conditions plus a runtime guard for checks that can't (yet) be expressed declaratively.
---
## Simplification Opportunities
### ~~Opportunity 1: Reactive `depends_on_gfms`~~ — DONE
`DependencyCompletionAlpha` implemented. `depends_on_gfms` is now **reactive** via `gfm_completed` working memory facts. The `_deferred_dependency_activations` retry loop remains as fallback.
### Opportunity 2: Batch-level TOCTOU resolution
Re-check `rete.is_gfm_ready()` with fresh working memory after each GFM execution within a batch. Could eliminate many simple `can_run_now()` overrides; complex ones (LocationGFM, TransportDecisionGFM) would remain.
### Opportunity 3: `PropertyValuePredicateAlpha` for sub-attribute checks
Wire up the existing but unused `PropertyValuePredicateAlpha` for sub-attribute guards (e.g., `transport_prop.cheapest_mode is not None`). Could convert 3–4 more GFMs to pure RETE.
### Opportunity 4: Simplify `_filter_activation()` — PARTIALLY DONE
Pipeline reduced from 6→5 steps (custom_predicate removed). If Opportunities 2–3 land, step 5 could be eliminated for most GFMs, leaving a 3-step pipeline for pure-RETE GFMs.
### Opportunity 5: Developer documentation & onboarding
- [ ] Architecture overview in `docs/rete-architecture.md`
- [ ] "How to write a new GFM" guide with condition examples
- [ ] Condition type reference with examples
- [ ] Debugging guide: how to trace why a GFM isn't firing
---
## Tracking
- [x] ~~Reactive depends_on_gfms~~ — **DONE**: `DependencyCompletionAlpha` implemented
- [x] ~~Pipeline simplification~~ — **PARTIAL**: 6→5 steps, custom_predicate removed
- [ ] Architecture documentation in `docs/`
- [ ] Investigate batch-level TOCTOU resolution
- [ ] Investigate PropertyValuePredicateAlpha usage
- [ ] Developer onboarding guide
issue