Skip to content

Expose unrecovered import failures in /import REST endpoint

Matthias Käppler requested to merge 210522-api-expose-import-failures into master

What does this MR do?

Refs #210522 (closed)

In the first MR we started by tracking and exposing correlation IDs in ProjectImportState. This MR completes the story by adding the ability to fetch "hard failures" (those relations that failed to import due to some unrecoverable error, or after retries were exhausted), correlate them to import state, and expose them via the /import REST endpoints.

Note the new failed_relations element:

{
  "id": 39,
  "description": "Ut soluta dignissimos sapiente atque nam eius perferendis.",
  "name": "test-import",
  "name_with_namespace": "Administrator / test-import",
  "path": "test-import",
  "path_with_namespace": "root/test-import",
  "created_at": "2020-04-03T12:48:12.314Z",
  "import_status": "finished",
  "correlation_id": "305413002760596bde01209d46a83cec",
  "failed_relations": [
    {
      "id": 324,
      "created_at": "2020-04-02T14:48:59.526Z",
      "exception_class": "RuntimeError",
      "exception_message": "blah",
      "source": "process_relation_item!",
      "relation_name": "ci_cd_settings"
    }
  ]
}

I specifically did not map ImportFailure to a REST resource 1:1, since it is a fairly technical/internal concept, and rather defined a resource called ProjectImportFailedRelation, which is a narrower concept than just any failure (for instance, a failure can be "soft" in the sense that it was a retry, but ultimately succeeded; we are not interested in exposing these via the API, as they have no bearing on the outcome of the import.) A failed relation as a REST resource it therefore always a "hard failure" i.e. a relation that failed to be imported, despite the import overall finishing successfully.

For reasons of MVC I also decided to simply limit the nested array to 100 items tops. Until there is a product requirement that these will actually be surfaced inside the product itself and be paginated or even fetched separately, I don't think we need endpoints to fetch these in full. For now we will rather report these via a pipeline into Slack as "hey look, we have an import that finished but had 100+ individual failures, maybe you should go look at this."

Database

After some experimentation we ended up deciding to add a partial index on import_failures that covers retry_count = 0. This will speed up selection of hard failures.

  • db/migrate/20200409085956_add_partial_index_on_import_failures_retry_count.rb
  • db/structure.sql
== 20200409085956 AddPartialIndexOnImportFailuresRetryCount: migrating ========
-- transaction_open?()
   -> 0.0000s
-- index_exists?(:import_failures, [:project_id, :correlation_id_value], {:where=>"retry_count = 0", :algorithm=>:concurrently})
   -> 0.0035s
-- execute("SET statement_timeout TO 0")
   -> 0.0003s
-- add_index(:import_failures, [:project_id, :correlation_id_value], {:where=>"retry_count = 0", :algorithm=>:concurrently})
   -> 0.0065s
-- execute("RESET ALL")
   -> 0.0003s
== 20200409085956 AddPartialIndexOnImportFailuresRetryCount: migrated (0.0108s)

git@f3c2b5e07624:~/gitlab$ bin/rake db:migrate:down VERSION=20200409085956

== 20200409085956 AddPartialIndexOnImportFailuresRetryCount: reverting ========
-- transaction_open?()
   -> 0.0000s
-- index_exists?(:import_failures, [:project_id, :correlation_id_value], {:algorithm=>:concurrently})
   -> 0.0038s
-- execute("SET statement_timeout TO 0")
   -> 0.0003s
-- remove_index(:import_failures, {:algorithm=>:concurrently, :column=>[:project_id, :correlation_id_value]})
   -> 0.0062s
-- execute("RESET ALL")
   -> 0.0003s
== 20200409085956 AddPartialIndexOnImportFailuresRetryCount: reverted (0.0109s)

Query:

explain
  SELECT "import_failures".* 
  FROM "import_failures" 
  WHERE "import_failures"."project_id" = <snip> 
    AND "import_failures"."correlation_id_value" = <snip> 
    AND "import_failures"."retry_count" = 0
  ORDER BY "import_failures"."created_at" DESC 
  LIMIT 100

Before index:

Limit  (cost=6.30..6.30 rows=1 width=160) (actual time=2.056..2.056 rows=0 loops=1)
   Buffers: shared hit=3 read=2
   I/O Timings: read=1.834
   ->  Sort  (cost=6.30..6.30 rows=1 width=160) (actual time=2.053..2.055 rows=0 loops=1)
         Sort Key: import_failures.created_at DESC
         Sort Method: quicksort  Memory: 25kB
         Buffers: shared hit=3 read=2
         I/O Timings: read=1.834
         ->  Index Scan using index_import_failures_on_correlation_id_value on public.import_failures  (cost=0.29..6.29 rows=1 width=160) (actual time=1.907..1.907 rows=0 loops=1)
               Index Cond: ((import_failures.correlation_id_value)::text = 'cid'::text)
               Filter: ((import_failures.project_id = <snip>) AND (import_failures.retry_count = 0))
               Rows Removed by Filter: 0
               Buffers: shared read=2
               I/O Timings: read=1.834

After index:

Limit  (cost=4.32..4.32 rows=1 width=160) (actual time=0.266..0.266 rows=0 loops=1)
   Buffers: shared hit=6 read=2
   I/O Timings: read=0.169
   ->  Sort  (cost=4.32..4.32 rows=1 width=160) (actual time=0.264..0.264 rows=0 loops=1)
         Sort Key: import_failures.created_at DESC
         Sort Method: quicksort  Memory: 25kB
         Buffers: shared hit=6 read=2
         I/O Timings: read=0.169
         ->  Index Scan using idx_import_failures_on_hard_failures on public.import_failures  (cost=0.29..4.31 rows=1 width=160) (actual time=0.198..0.198 rows=0 loops=1)
               Index Cond: ((import_failures.project_id = <snip>) AND ((import_failures.correlation_id_value)::text = 'cid'::text))
               Buffers: shared hit=3 read=2
               I/O Timings: read=0.169

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Tested locally against a puma container as well.

Edited by 🤖 GitLab Bot 🤖

Merge request reports