Add columns and indices on web hooks [RUN ALL RSPEC] [RUN AS-IF-FOSS]

Merged Alex Kalderimis requested to merge ajk-disable-broken-webhooks into master

What does this MR do?

This MR makes changes to the backend representation of webhooks so that we can mark repeatedly failing webhooks as failing, and avoid executing them in the future.

Follow-up work will surface this information in the UI, and allow the web-hooks to be reset.

see: #329213

Backoff Strategy

See this comment for an explanation of the logic behind the back-off strategy: https://gitlab.com/gitlab-org/gitlab/-/issues/329213#note_564836459

The backoff strategy implemented is:

  • regular 4xx failures: 3 strikes policy. Once we have seen 4 failures then we consider the hook disabled, and never try it again
  • 5xx failures, and more low level stuff (network errors, etc): exponential backoff.

The cooldown for 500 errors starts at 10 minutes, and is triggered on the first internal server error. The backoff factor is 2.0. The backoff never exceeds 24 hours.

The reasoning is that 4xx indicates that the webhook was probably set up incorrectly, but 5xx suggests an unhealthy service, one which we may just need to give some cool-down time.

429 TOO MANY REQUESTS is a possible exception - we might want to handle this differently.

Manual QA

To play around with this feature, you need to enable the feature-flag:

Feature.enable(:web_hooks_disable_failed)

Then we can verify that this work by:

  • creating a web hook to an endpoint (I like to use https://webhook.site)
  • verify that the endpoint receives data
  • modify the endpoint to return failing statuses (e.g. 400)
  • make at least 4 changes - check that the endpoint receives 4 messages
  • verify that the endpoint stops receiving data

We can then double check in the console:

p = Project.find_by_full_path('some-path/to-project')
hook = ProjectHook.where(project: p).first
hook.executable?
=> false
hook.web_hook_logs.pluck(:response_status)
=> [... "400", "400", "400", "400"]

We can now restore the hook. First in the console:

p = Project.find_by_full_path('some-path/to-project')
hook = ProjectHook.where(project: p).first
hook.enable!
hook.executable?
=> true

Now make any further change, and verify that the endpoint receives a new message.

Migration output:

Up

== 20210429181325 AddFailureTrackingToWebHooks: migrating =====================
-- change_table(:web_hooks, {:bulk=>true})
   -> 0.0052s
== 20210429181325 AddFailureTrackingToWebHooks: migrated (0.0053s) ============

== 20210504085144 AddIndexOnWebHookProjectIdRecentFailures: migrating =========
-- transaction_open?()
   -> 0.0000s
-- index_exists?(:web_hooks, [:project_id, :recent_failures], {:name=>"index_web_hooks_on_project_id_recent_failures", :algorithm=>:concurrently})
   -> 0.0074s
-- add_index(:web_hooks, [:project_id, :recent_failures], {:name=>"index_web_hooks_on_project_id_recent_failures", :algorithm=>:concurrently})
   -> 0.0186s
== 20210504085144 AddIndexOnWebHookProjectIdRecentFailures: migrated (0.0283s)

Down

== 20210504085144 AddIndexOnWebHookProjectIdRecentFailures: reverting =========
-- transaction_open?()
   -> 0.0000s
-- indexes(:web_hooks)
   -> 0.0039s
-- remove_index(:web_hooks, {:algorithm=>:concurrently, :name=>"index_web_hooks_on_project_id_recent_failures"})
   -> 0.0028s
== 20210504085144 AddIndexOnWebHookProjectIdRecentFailures: reverted (0.0078s)

== 20210429181325 AddFailureTrackingToWebHooks: reverting =====================
-- remove_column(:web_hooks, :disabled_until, :timestamptz, {})
   -> 0.0029s
-- remove_column(:web_hooks, :backoff_count, :integer, {:null=>false, :limit=>2, :default=>0})
   -> 0.0013s
-- remove_column(:web_hooks, :recent_failures, :integer, {:null=>false, :limit=>2, :default=>0})
   -> 0.0015s
== 20210429181325 AddFailureTrackingToWebHooks: reverted (0.0071s) ============

### Changes to queries

This MR changes select_active to add a new where clause:

SELECT "web_hooks".* FROM "web_hooks"
  WHERE "web_hooks"."project_id" = 278964
    AND (recent_failures <= 3
    AND (disabled_until IS NULL OR disabled_until < '2021-05-04 06:23:39.829460'))

Plan:

 Index Scan using web_hooks_project_id_recent_failures_disabled_until_idx on public.web_hooks  (cost=0.42..3.45 rows=1 width=282) (actual time=0.151..0.170 rows=2 loops=1)
   Index Cond: ((web_hooks.project_id = 278964) AND (web_hooks.recent_failures <= 3))
   Filter: ((web_hooks.disabled_until IS NULL) OR (web_hooks.disabled_until < '2021-05-04 06:23:39.82946+00'::timestamp with time zone))
   Rows Removed by Filter: 0
   Buffers: shared hit=2 read=3
   I/O Timings: read=0.078

Timings:

Time: 2.317 ms
  - planning: 2.109 ms
  - execution: 0.208 ms
    - I/O read: 0.078 ms
    - I/O write: N/A

Shared buffers:
  - hits: 2 (~16.00 KiB) from the buffer pool
  - reads: 3 (~24.00 KiB) from the OS file cache, including disk I/O
  - dirtied: 0
  - writes: 0

This MR changes the query in Group.execute_hooks to include the executable scope:

SQL:

SELECT "web_hooks".* FROM "web_hooks"
  WHERE "web_hooks"."type" = 'GroupHook'
    AND "web_hooks"."group_id" IN (SELECT "namespaces"."id" FROM "namespaces" WHERE "namespaces"."type" = 'Group' AND "namespaces"."id" = 22)
    AND (recent_failures <= 3 AND (disabled_until IS NULL OR disabled_until < '2021-05-04 10:27:08.292342'))

Plan:

 Nested Loop Semi Join  (cost=0.71..6.77 rows=1 width=282) (actual time=2.163..2.164 rows=0 loops=1)
   Buffers: shared read=2
   I/O Timings: read=2.115
   ->  Index Scan using index_web_hooks_on_group_id on public.web_hooks  (cost=0.28..3.31 rows=1 width=282) (actual time=2.162..2.162 rows=0 loops=1)
         Index Cond: (web_hooks.group_id = 22)
         Filter: ((web_hooks.recent_failures <= 3) AND ((web_hooks.disabled_until IS NULL) OR (web_hooks.disabled_until < '2021-05-04 10:27:08.292342+00'::timestamp with time zone)))
         Rows Removed by Filter: 0
         Buffers: shared read=2
         I/O Timings: read=2.115
   ->  Index Only Scan using index_namespaces_on_type_and_id_partial on public.namespaces  (cost=0.43..3.45 rows=1 width=4) (actual time=0.000..0.000 rows=0 loops=0)
         Index Cond: ((namespaces.type = 'Group'::text) AND (namespaces.id = 22))
         Heap Fetches: 0

Timings:

Time: 5.736 ms
  - planning: 3.533 ms
  - execution: 2.203 ms
    - I/O read: 2.115 ms
    - I/O write: N/A

Shared buffers:
  - hits: 0 from the buffer pool
  - reads: 2 (~16.00 KiB) from the OS file cache, including disk I/O
  - dirtied: 0
  - writes: 0

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

The main risks of this change is that we may disable a user's web hooks in a way that prevents them from doing anything to remediate the situation. For this reason, we should not enable the feature flag until UI elements have been introduced.

Edited by Alex Kalderimis