feat(compiler): truncate Date/DateTime group_by keys to bucket units
Summary
group_by on a Date or DateTime property used to return one bucket per raw microsecond, so any time-windowed aggregation (e.g. "failed jobs per month over the last 6 months") either hit the row limit immediately or returned thousands of one-row groups.
This MR adds an optional truncate field on Property group keys accepting minute, hour, day, week, month, quarter, or year. The lowering pass wraps the column in toStartOf<Unit>(...) for both the SELECT projection and the GROUP BY key.
Relates to gitlab-org/gitlab#599752 and #601 (closed).
Cardinality guard
minute and hour truncations can fan out at the data-retention horizon. Validate rejects those tiers unless the truncated node has node_ids set or the query has at least one filter on the truncated property. Bigger units (day, week, month, quarter, year) are always allowed since their bucket count stays bounded by typical retention.
Error example: group_by[0]: truncate "minute" on "finished_at" requires either node_ids on "j" or at least one filter on "finished_at" to bound bucket cardinality.
Why arrow.rs changed
toStartOfMonth/Quarter/Year/Week/Day return ClickHouse Date, which Arrow streams as Date32Array. ArrowUtils::extract_value had no Date32/Date64 arm, so bucket keys arrived at the formatter as null. Added the arms; bucket keys now serialize as YYYY-MM-DD strings.
Verification
| Check | Result |
|---|---|
| Compiler unit tests | 332 / 332 pass |
| Local integration tests | 191 / 191 pass |
| Data-correctness subtests (ClickHouse testcontainer) | 83 / 83 pass, including a new monthly-bucket subtest |
mise lint:code (clippy + fmt) |
clean |
Prod aggregation: truncate=month on Job, 3-month window |
3 buckets returned, SQL contains toStartOfMonth(j.finished_at) |
Prod aggregation: truncate=day on Job, ~3-week window |
19 buckets returned, SQL contains toStartOfDay(j.finished_at) |
Prod aggregation: truncate=minute without selectivity |
rejected pre-execution by validate with clear error |
Touched files
- input.rs — new
TruncateUnitenum,truncatefield onInputGroupByKey::Property, output name defaults to{property}_{unit}. - validate.rs — data-type check and selectivity guard.
- lower/aggregation.rs — wraps the column expression.
- arrow.rs —
Date32/Date64extraction. - graph_query.schema.json —
truncateenum on the Property variant.
QUERY_DSL_VERSION bumped 2.0.0 -> 2.1.0.