[PREP] GKGaaS Graph Engine - Performance Strategy & SLOs
## Summary Define formal SLOs, establish performance baselines, and fix known failures. Maps to PREP categories: performance_strategy/performance_requirements, performance_strategy/performance_testing, performance_strategy/performance_validation. ## Context The simulator benchmarks 29 queries against 11M nodes + 100M edges (~954 MiB memory limit). Current results: mean 159ms, median 26ms, max 1389ms. Three queries fail with MEMORY_LIMIT_EXCEEDED. Resource controls exist (3-hop max, 1000-row limit, 30s timeout) but no formal SLOs are defined. ## Tasks ### SLO definition - [ ] Define compilation latency SLOs — p95, p99 targets for compile() itself - [ ] Define end-to-end query latency SLOs per query type (search, traversal, aggregation, path_finding, neighbors) for .com and self-managed - [ ] Define acceptable failure rate under normal load ### Fix known failures - [ ] Fix Shortest path between projects (path_finding) — MEMORY_LIMIT_EXCEEDED - [ ] Fix Vulnerabilities fixed by MRs (traversal) — MEMORY_LIMIT_EXCEEDED - [ ] Fix Long-running pipelines (search) — MEMORY_LIMIT_EXCEEDED ### Benchmarking - [ ] Establish performance baseline — capture current simulator results as official regression baseline - [ ] Compilation-only benchmarks — measure compile() latency separate from ClickHouse execution - [ ] Query type-specific benchmarks — separate p50/p95/p99 per query type - [ ] Worst-case query pattern analysis — within allowed limits (3 hops, 1000 rows, 5 relationship steps), find the most expensive valid query ### Continuous validation - [ ] Concurrent query testing — multiple queries compiling and executing simultaneously - [ ] Scale testing — run simulator at larger scales (500M+ edges) to find degrading patterns - [ ] Nightly regression tests in CI — automated performance checks to catch regressions
issue