[PREP] GKGaaS Graph Engine - Performance Strategy & SLOs
## Summary
Define formal SLOs, establish performance baselines, and fix known failures. Maps to PREP categories: performance_strategy/performance_requirements, performance_strategy/performance_testing, performance_strategy/performance_validation.
## Context
The simulator benchmarks 29 queries against 11M nodes + 100M edges (~954 MiB memory limit). Current results: mean 159ms, median 26ms, max 1389ms. Three queries fail with MEMORY_LIMIT_EXCEEDED. Resource controls exist (3-hop max, 1000-row limit, 30s timeout) but no formal SLOs are defined.
## Tasks
### SLO definition
- [ ] Define compilation latency SLOs — p95, p99 targets for compile() itself
- [ ] Define end-to-end query latency SLOs per query type (search, traversal, aggregation, path_finding, neighbors) for .com and self-managed
- [ ] Define acceptable failure rate under normal load
### Fix known failures
- [ ] Fix Shortest path between projects (path_finding) — MEMORY_LIMIT_EXCEEDED
- [ ] Fix Vulnerabilities fixed by MRs (traversal) — MEMORY_LIMIT_EXCEEDED
- [ ] Fix Long-running pipelines (search) — MEMORY_LIMIT_EXCEEDED
### Benchmarking
- [ ] Establish performance baseline — capture current simulator results as official regression baseline
- [ ] Compilation-only benchmarks — measure compile() latency separate from ClickHouse execution
- [ ] Query type-specific benchmarks — separate p50/p95/p99 per query type
- [ ] Worst-case query pattern analysis — within allowed limits (3 hops, 1000 rows, 5 relationship steps), find the most expensive valid query
### Continuous validation
- [ ] Concurrent query testing — multiple queries compiling and executing simultaneously
- [ ] Scale testing — run simulator at larger scales (500M+ edges) to find degrading patterns
- [ ] Nightly regression tests in CI — automated performance checks to catch regressions
issue