perf(db): batch required_matching node-prop writes (final per-flow DB roundtrip)

The last un-batched per-flow DB roundtrip. MatchProductNameGFM.run()
issued one UPDATE per unmatched flow via update_node_prop(
required_matching, append=True) — ~7997 calls @2000ingr stress.
Now buffered in NodeService keyed by root_node_uid and flushed in
one bulk UPDATE at scheduler quiescence via the update_node_prop_bulk
DB layer shipped in v0.5.432. In-memory PropListMutation stays
immediate.

This is the required_matching half reverted in v0.5.432. That revert
was judged on an N=10 A/B of the already-flaky
test_matching_and_cache_invalidation_complete_workflow. Re-verified
independently at N=20: baseline 13/20 pass, this change 14/20 pass —
statistically identical, the change does NOT worsen the flake.
required_matching is safe to defer: consumed only by FUTURE
calculations (graph reload) and the cleanup CLI, never within the
same request; /apply and /update-automatching invalidate by explicit
node_uid, not by reading required_matching.

Validation:
- two_origins CO2=1.2568, subrecipe CO2=0.0930 invariants hold
- 120/120 broad gauntlet
- 41/44 legacy_recipe_router (3 pre-existing batch flakes only)
- N=20 flake A/B independently re-run: 13/20 baseline vs 14/20 change

NOTE ON CLUSTER MEASUREMENT: the dagster cluster benchmark is too
noisy for single-run per-tag comparison — two runs of the identical
v0.5.432 build 5h apart differed 2-4x (combined_40: 5.34s vs 9.60s;
develop baseline itself shifted 4.60s -> 3.64s). All perf claims in
the v0.5.426-v0.5.433 series rest on deterministic local cProfile
call-count reductions + stable correctness invariants, NOT on
single cluster wall-time runs. A rigorous cluster A/B needs N>=10
interleaved runs per build.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>