Track the number of unarchived jobs which should have been archived already
Summary
Discussed in https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6405#note_152133805 and https://gitlab.slack.com/archives/CA5NVTFFB/p1553013244116000?thread_ts=1553002763.113400&cid=CA5NVTFFB.
We should track the number of Ci::Build.finished.with_live_trace.where("finished_at < ?", 30.minutes.ago).count. This is the number of unarhicved live traces, which should have been archived already.
This number could increase by an incident on dependent services, such as NFS, Object Storage, PostgreSQL and Redis.
Database Query
This database query looks like this
SELECT COUNT(*) FROM "ci_builds" WHERE "ci_builds"."type" IN ('Ci::Build') AND "ci_builds"."status" IN ('success', 'failed', 'canceled') AND (EXISTS (SELECT 1 FROM "ci_build_trace_chunks" WHERE (ci_builds.id = ci_build_trace_chunks.build_id))) AND "ci_builds"."finished_at" < '2019-03-20 01:00:40.652986';
The query plan
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=15529.74..15529.75 rows=1 width=8) (actual time=111.441..111.441 rows=1 loops=1)
-> Nested Loop (cost=1160.72..15523.62 rows=2445 width=0) (actual time=11.697..111.312 rows=1017 loops=1)
-> HashAggregate (cost=1160.15..1199.89 rows=3974 width=4) (actual time=11.527..13.502 rows=3964 loops=1)
Group Key: ci_build_trace_chunks.build_id
-> Index Only Scan using index_ci_build_trace_chunks_on_build_id_and_chunk_index on ci_build_trace_chunks (cost=0.41..1142.83 rows=6925 width=4) (actual time=0.037..10.048 rows=6862 loops=1)
Heap Fetches: 6238
-> Index Scan using ci_builds_pkey on ci_builds (cost=0.57..3.59 rows=1 width=4) (actual time=0.023..0.024 rows=0 loops=3964)
Index Cond: (id = ci_build_trace_chunks.build_id)
Filter: ((finished_at < '2019-03-20 01:00:40.652986'::timestamp without time zone) AND ((type)::text = 'Ci::Build'::text) AND ((status)::text = ANY ('{success,failed,canceled}'::text[])))
Rows Removed by Filter: 1
Planning time: 4.744 ms
Execution time: 111.614 ms
(12 rows)