RETE Scheduling Architecture: Documentation & Simplification Plan
# RETE Scheduling Architecture: Documentation & Simplification Plan **Context:** Issue #10 (Proposals A–F) is complete. This issue documents the current architecture for developer onboarding and tracks simplification opportunities. **Related:** #10 (completed), MR !1655 (`feature/rete-pure-conditions`) · **Last updated:** 2026-03-09 --- ## 1. System Overview The RETE scheduling system reactively schedules Gap Filling Modules (GFMs) when their preconditions are met, replacing the old poll-every-GFM-every-batch approach. 30 GFMs total: 21 pure RETE, 9 hybrid. ``` ┌───────────┐ ┌─────────┐ ┌────────────┐ ┌──────────┐ ┌────────────┐ ┌───────────┐ │ CalcGraph │───▶│ Working │───▶│ Alpha Net │───▶│ AND-Join │───▶│ Production │───▶│ Scheduler │ │ Mutation │ │ Memory │ │ (15 types) │ │ │ │ Node │ │ Loop │ └───────────┘ └─────────┘ └────────────┘ └──────────┘ └────────────┘ └─────┬─────┘ │ ┌────────────────────────────────────────────────────────────────────────────────────┘ ▼ ┌───────────┐ │ GFM.run() │── new mutations ──▶ (back to CalcGraph ── reactive feedback loop) └───────────┘ ``` Mutations flow left-to-right through the RETE network. When a GFM executes, its output mutations feed back into CalcGraph, triggering new RETE activations — a **reactive feedback loop** that continues until the graph stabilizes. --- ## 2. Fact Lifecycle CalcGraph mutations are converted to working memory facts by `MutationToFactConverter` (in `mutation_adapter.py`), then asserted into `ReteWorkingMemory`. ``` CalcGraph.apply_mutation() ──▶ ReteSubscriber ──▶ WorkingMemory.assert_fact() ``` | Mutation Type | Facts Produced | |---|---| | `PropMutation` | `property_value` + `property_type` | | `PropListMutation` | `property_value` + `property_type` (for list props like glossary_tags) | | `RawPropMutation` | `property_value` | | `AddNodeMutation` | `node_exists` + `node_type_hierarchy` + initial property facts | | `AddEdgeMutation` | `has_parent` (on child) + `has_child` (on parent) | | `RemoveEdgeMutation` | retract `has_parent` + `has_child` | | `DuplicateNodeMutation` | `node_exists` + all property facts | | GFM completion | `gfm_completed` (key=gfm_name, value=True) | Each fact is a `WorkingMemoryFact(node_uid, fact_type, key, value, timestamp)`. Working memory maintains four indexes: `_facts` (primary), `_parent_idx` / `_child_idx` (structural), `_type_idx` (node types), and `_gfm_completed_index` (dependency tracking). --- ## 3. The RETE Network Alpha nodes test individual conditions. When ALL alphas for a GFM pass for a given node, the AND-join fires and the production node queues a `(gfm, uid)` activation. **Example: ConservationGFM** (needs FlowNode + is_subdivision=False + OriginGFM done) ``` ┌────────────────────┐ │ NodeTypeAlpha │──┐ │ (FlowNode) │ │ └────────────────────┘ │ │ ┌──────────┐ ┌────────────────┐ ┌────────────────────┐ ├──▶│ AND-Join │───▶│ ProductionNode │──▶ "ready" │ NodeAttributeAlpha │──┤ │ (all 3) │ │ Conservation │ │ (is_subdiv=False) │ │ └──────────┘ └────────────────┘ └────────────────────┘ │ │ ┌────────────────────┐ │ │ DepCompletionAlpha │──┘ │ (OriginGFM done) │ └────────────────────┘ ``` Each alpha maintains a `memory: set[uid]` of passing nodes. On mutation, only **subscribed** alphas are re-evaluated (via `_property_to_alphas` and `_fact_type_to_alphas` indexes). Relational alphas trigger directional fan-out to parents/children. ### Alpha Node Catalog (15 types) | Category | Alpha Types | |---|---| | **Type checks** | `NodeTypeAlpha`, `NegatedNodeTypeAlpha` | | **Property checks** | `PropertyExistsAlpha`, `PropertyNotExistsAlpha`, `PropertyTypeAlpha`, `PropertyTypeOrAlpha`, `PropertyValuePredicateAlpha` | | **Relational** | `RelatedNodeAlpha`, `CrossNodePropertyAlpha` | | **Compound** | `OrAlpha`, `CompoundOrAlpha`, `CompoundAndAlpha` | | **Glossary** | `TermValueAlpha` | | **Attribute** | `NodeAttributeAlpha` (immutable: `is_subdivision`, `is_child_of_root`) | | **Dependency** | `DependencyCompletionAlpha` (reactive `gfm_completed` tracking) | --- ## 4. Scheduler Loop ``` ╔════════════════════════════════════════════╗ ║ MAIN SCHEDULER LOOP ║ ╚════════════════════════════════════════════╝ │ ▼ ┌─ Collect ──────────────────────────────────┐ │ Merge init + RETE pending activations │ │ Separate final-phase → deferred │ └──────────────────────┬─────────────────────┘ │ ▼ ┌─ Execute Batch ────────────────────────────┐ │ Sort: (DAG priority, name, │ │ FlowNode-first, insert order) │ │ For each: filter → spawn → run │ │ Each GFM: begin → run → flush → done │ │ On done: assert gfm_completed fact │ └──────────────────────┬─────────────────────┘ │ ▼ ┌─ Process New Nodes ────────────────────────┐ │ Scan added nodes → initial RETE eval │ └──────────────────────┬─────────────────────┘ │ ▼ ┌─ Retry Deferred (fallback) ────────────────┐ │ Re-check depends_on_gfms for deferred │ │ (primary path is reactive via RETE) │ └──────────────────────┬─────────────────────┘ │ ▼ ┌─ Final Phase (if queue empty) ─────────────┐ │ Sort by final_phase_priority │ │ Execute one-by-one │ │ Pause if new regular work triggered │ └──────────────────────┬─────────────────────┘ │ ▼ ◇ More work? ──yes──▶ (back to Collect) │no ▼ DONE ``` The scheduler uses **double batching** when `_use_parallel=True`: an outer batch wraps the entire `_execute_batch()`, while each GFM gets an inner batch for its mutations. Inner flushes trigger selective alpha re-evaluation; the outer flush consolidates new activations for the next iteration. --- ## 5. Filter Activation Pipeline Every `(gfm, uid)` pair passes through `_filter_activation()` before execution — a 5-step decision pipeline: ``` (gfm, uid) ──▶ final-phase? ──yes──▶ defer to L3 │no ▼ finished? ──yes──▶ skip │no ▼ attr_cond? ──fail──▶ cancel (permanent) │pass ▼ deps_met? ──no──▶ defer (fallback) │yes ▼ can_run_now() ──cancel──▶ cancel │ ──resched──▶ requeue (limit 250/batch) │ready ▼ EXECUTE ``` Steps 1–3 are fast checks on static state. Step 4 (`depends_on_gfms`) is primarily handled reactively by `DependencyCompletionAlpha` — the filter step is a fallback safety net. Step 5 spawns the worker and calls `can_run_now()` — only relevant for the 9 hybrid GFMs. --- ## 6. Execution Layers ``` ┌─ L1: Static DAG Priority ───────────────────────────────────────────────────┐ │ Built once at startup from GFMCondition + GFMOutput declarations. │ │ 4 inference rules: │ │ required_props → sets_props │ PTC.prop → sets_props │ │ forbidden_props → reverse │ depends_on_gfms → explicit │ │ Kahn's topological sort (alphabetic tie-breaking) → _gfm_priority │ └─────────────────────────────────────────────────────────────────────────────┘ ▼ feeds sort key ┌─ L2: Reactive RETE + Guards ────────────────────────────────────────────────┐ │ 15 alpha types evaluate on mutation. AND-join → production → ready. │ │ _filter_activation(): attr check → dep check → can_run_now() │ └─────────────────────────────────────────────────────────────────────────────┘ ▼ when queue drains ┌─ L3: Final Phase ───────────────────────────────────────────────────────────┐ │ Perishability(10) → WaterScarcity(20) → Aggregation(100) → MatrixCalc(110) │ │ → Vitascore(120) → PostAgg(130) → ReportAgg(200) │ │ One-by-one. Pause if new regular work triggered. │ └─────────────────────────────────────────────────────────────────────────────┘ ``` L1 determines **batch ordering** (which GFM runs first within a cycle). L2 determines **readiness** (whether a GFM should run at all). L3 handles **barrier semantics** — GFMs that need the entire graph stabilized before executing (e.g., Perishability needs all leaf FPFs tagged, WaterScarcity needs all BW node parents created). --- ## 7. GFM Landscape | Pure RETE (21) | Hybrid (9) — `can_run_now()` reason | |---|---| | AddClientNodesGFM | DailyFoodUnitGFM — `gfms_to_skip` flag | | AggregationGFM (abstract) | IngredientAmountEstGFM — parent nutrient cache | | AttachFoodTagsGFM | LinkTermToActivityGFM — sub_nodes TOCTOU guard | | ConservationGFM | LocationGFM — BW node edge case | | FoodLossGFM | OriginGFM — subdivision/LTAN state | | GreenhouseGFM | ProcessingGFM — sub_node type check | | ImpactAssessmentGFM | TransportDecisionGFM — cheapest_mode cancel | | IngredientSplitterGFM | TransportModeDistGFM — location/origin wait | | InventoryConnectorGFM | UnitWeightConvGFM — production_amount cascade | | LinkFoodCategoriesGFM | | | MatchProductNameGFM | | | MatrixCalculationGFM | | | MergeLinkedNodesGFM | | | NutrientSubdivisionGFM | | | PerishabilityGFM | | | PostNutrientsAggregGFM | | | RainforestGFM | | | ReportAggregationGFM | | | VitascoreGFM | | | WaterScarcityGFM | | | (DailyFoodUnitGFM pending) | | The **21 pure RETE** GFMs have no `can_run_now()` override — RETE conditions fully express their scheduling logic. The **9 hybrid** GFMs use RETE for primary conditions plus a runtime guard for checks that can't (yet) be expressed declaratively. --- ## Simplification Opportunities ### ~~Opportunity 1: Reactive `depends_on_gfms`~~ — DONE `DependencyCompletionAlpha` implemented. `depends_on_gfms` is now **reactive** via `gfm_completed` working memory facts. The `_deferred_dependency_activations` retry loop remains as fallback. ### Opportunity 2: Batch-level TOCTOU resolution Re-check `rete.is_gfm_ready()` with fresh working memory after each GFM execution within a batch. Could eliminate many simple `can_run_now()` overrides; complex ones (LocationGFM, TransportDecisionGFM) would remain. ### Opportunity 3: `PropertyValuePredicateAlpha` for sub-attribute checks Wire up the existing but unused `PropertyValuePredicateAlpha` for sub-attribute guards (e.g., `transport_prop.cheapest_mode is not None`). Could convert 3–4 more GFMs to pure RETE. ### Opportunity 4: Simplify `_filter_activation()` — PARTIALLY DONE Pipeline reduced from 6→5 steps (custom_predicate removed). If Opportunities 2–3 land, step 5 could be eliminated for most GFMs, leaving a 3-step pipeline for pure-RETE GFMs. ### Opportunity 5: Developer documentation & onboarding - [ ] Architecture overview in `docs/rete-architecture.md` - [ ] "How to write a new GFM" guide with condition examples - [ ] Condition type reference with examples - [ ] Debugging guide: how to trace why a GFM isn't firing --- ## Tracking - [x] ~~Reactive depends_on_gfms~~ — **DONE**: `DependencyCompletionAlpha` implemented - [x] ~~Pipeline simplification~~ — **PARTIAL**: 6→5 steps, custom_predicate removed - [ ] Architecture documentation in `docs/` - [ ] Investigate batch-level TOCTOU resolution - [ ] Investigate PropertyValuePredicateAlpha usage - [ ] Developer onboarding guide
issue