Skip to content

Archive old authentication_events

What does this MR do and why?

Enforces a 1 year retention policy on records in the the authentication_events table. This table is currently over the large table limit, and a 1 year retention policy will bring it well under the limit.

What

Enforce the 1 year retention policy by:

  • Adding authentication_event_archived_records table and basic model
    • The table has no FKs and minimal constraints because it is meant to be a simple dumping ground for data
    • The Rails model has no real functionality since this model won't be used throughout the application, it will just be used in one data migration
  • Adding ArchiveAuthenticationEvents batched background migration.
    • Records created more than 1 year ago in authentication_events are copied to authentication_events_archived_records and then deleted from authentication_events
    • In this way, we soft delete records in the archive.
  • Enqueue the BBM only for gitlab.com. We will enqueue for self-managed in a follow-up, after we've confirmed the BBM successfully ran for .com.
  • We'll drop the authentication_events_archived_records table in a few milestones once we've confirmed the retention policy is being enforced with no issues

Data

The migration will delete 77.1% of rows from this table. See this comment for more details.

Why?

authentication_events is now over the large table limit. This work supports pruning the operational table and thus getting it back under the table limit.

References

See this issue for detailed planning discussion, where multiple approaches are considered: #555382

Issue: #545007

Is this reversible?

  • The migration to queue the batched background migration is reversible.
  • The operations performed by the batched background migration are not automatically reversible.
    • The BBM is deleting records from authentication_events and inserting them into authentication_event_archived_records
      • The DELETE+INSERT is atomic so every row deleted from authentication_events has a copy stored in authentication_event_archived_records
      • To restore deleted records, we'd want to run queries to transfer records in the opposite direction, from authentication_event_archived_records to authentication_events.
        • Likely as its own BBM. Or spot transfer via console.

For example, this query will restore 3 records from the archive:

INSERT INTO authentication_events
              (id, created_at, user_id, result, ip_address, provider, user_name)
            SELECT
              id, created_at, user_id, result, ip_address, provider, user_name
            FROM authentication_event_archived_records archive WHERE archive.id IN (1, 2, 3)

Possible impact of error?

No critical flows would be blocked in case of error. Any errors would largely result in under-reporting data:

  • Show fewer auth logs to users than we should
  • Consider fewer auth log entries in "most recent IP address" calculation, causing e-mail verification to trigger more frequently than normal
  • "Most used IP address" on abuse reports detail page considers fewer IPs than normal

All these impacts would be fixed as soon as deleted records are restored from archive.

Screenshots or screen recordings

Batched background migration queries

Queries for 1 batch run, from rspec output (see this comment for details):

SELECT "authentication_events"."id" FROM "authentication_events" WHERE "authentication_events"."id" BETWEEN 1 AND 3 ORDER BY "authentication_events"."id" ASC LIMIT 1

-- sub-batch 1

SELECT "authentication_events"."id" FROM "authentication_events" WHERE "authentication_events"."id" BETWEEN 1 AND 3 AND "authentication_events"."id" >= 1 ORDER BY "authentication_events"."id" ASC LIMIT 1 OFFSET 2

WITH deleted_records AS (
    DELETE FROM authentication_events
    WHERE id IN (SELECT "authentication_events"."id" FROM "authentication_events" WHERE "authentication_events"."id" BETWEEN 1 AND 3 AND "authentication_events"."id" >= 1 AND "authentication_events"."id" < 3 AND "authentication_events"."created_at" <= '2023-01-01 00:00:00' LIMIT 2)
    RETURNING *
)
INSERT INTO authentication_event_archived_records (SELECT * FROM deleted_records)

-- sub-batch 2

SELECT "authentication_events"."id" FROM "authentication_events" WHERE "authentication_events"."id" BETWEEN 1 AND 3 AND "authentication_events"."id" >= 3 ORDER BY "authentication_events"."id" ASC LIMIT 1 OFFSET 2 /*application:test,correlation_id:fd560589b624884ddf7351e28075a1f1,db_config_database:gitlabhq_test,db_config_name:main,line:/app/models/concerns/each_batch.rb:84:in `block in each_batch'*/

WITH deleted_records AS (
    DELETE FROM authentication_events
    WHERE id IN (SELECT "authentication_events"."id" FROM "authentication_events" WHERE "authentication_events"."id" BETWEEN 1 AND 3 AND "authentication_events"."id" >= 3 AND "authentication_events"."created_at" <= '2023-01-01 00:00:00' LIMIT 2)
    RETURNING *
)
INSERT INTO authentication_event_archived_records (SELECT * FROM deleted_records)

Query plans

Batched selects

Query plans
explain SELECT authentication_events.id FROM authentication_events WHERE authentication_events.id BETWEEN 20069040 AND 20757411 AND authentication_events.id >= 20069040 ORDER BY authentication_events.id ASC LIMIT 1 OFFSET 1000

Limit  (cost=27.92..27.94 rows=1 width=8) (actual time=3.047..3.048 rows=1 loops=1)  
   Buffers: shared hit=14 read=13  
   I/O Timings: read=2.791 write=0.000  
   ->  Index Only Scan using authentication_events_pkey on public.authentication_events  (cost=0.57..18620.51 rows=680929 width=8) (actual time=1.004..3.016 rows=1001 loops=1)  
         Index Cond: ((authentication_events.id >= 2  

Time: 4.032 ms  
  - planning: 0.951 ms  
  - execution: 3.081 ms  
    - I/O read: 2.791 ms  
    - I/O write: 0.000 ms  
  
Shared buffers:  
  - hits: 14 (~112.00 KiB) from the buffer pool  
  - reads: 13 (~104.00 KiB) from the OS file cache, including disk I/O  
  - dirtied: 0  
  - writes: 0  
explain SELECT authentication_events.* FROM authentication_events WHERE authentication_events.id BETWEEN 20069040 AND 20757411 AND authentication_events.id >= 20069040 AND authentication_events.id < 20070040 AND authentication_events.created_at <= '2024-08-27 00:00:00'

Index Scan using authentication_events_pkey on public.authentication_events  (cost=0.57..86.67 rows=663 width=60) (actual time=0.876..72.699 rows=1000 loops=1)
   Index Cond: ((authentication_events.id >= 20069040) AND (authentication_events.id <= 20757411) AND (authentication_events.id >= 20069040) AND (authentication_events.id < 20070040))
   Filter: (authentication_events.created_at <= '2024-08-27 00:00:00+00'::timestamp with time zone)
   Rows Removed by Filter: 0
   Buffers: shared hit=879 read=123
   I/O Timings: read=70.453 write=0.000
Settings: seq_page_cost = '4', work_mem = '100MB', effective_cache_size = '472585MB', jit = 'off', random_page_cost = '1.5'

Time: 73.986 ms  
  - planning: 1.140 ms  
  - execution: 72.846 ms  
    - I/O read: 70.453 ms  
    - I/O write: 0.000 ms  
  
Shared buffers:  
  - hits: 879 (~6.90 MiB) from the buffer pool  
  - reads: 123 (~984.00 KiB) from the OS file cache, including disk I/O  
  - dirtied: 0  
  - writes: 0  

CTE DELETE+INSERT

Query plan
explain WITH deleted_records AS (
    DELETE FROM authentication_events
        WHERE id IN (SELECT "authentication_events"."id" FROM "authentication_events" WHERE "authentication_events"."id" BETWEEN 20069040 AND 20757411 AND "authentication_events"."id" >= 20069040 AND "authentication_events"."id" < 20070040 AND "authentication_events"."created_at" <= '2024-08-27 00:00:00' LIMIT 1000)
        RETURNING *
)
INSERT INTO authentication_event_archived_records
  (id, created_at, user_id, result, ip_address, provider, user_name, archived_at)
SELECT
    id, created_at, user_id, result, ip_address, provider, user_name, CURRENT_TIMESTAMP as archived_at
FROM deleted_records

 ModifyTable on public.authentication_event_archived_records  (cost=3025.81..3043.97 rows=0 width=0) (actual time=111.700..111.703 rows=0 loops=1)
   Buffers: shared hit=9019 read=134 dirtied=143 written=16
   WAL: records=2001 fpi=124 bytes=1191788
   I/O Timings: read=99.223 write=0.364
   CTE deleted_records
     ->  ModifyTable on public.authentication_events  (cost=121.18..3025.81 rows=807 width=38) (actual time=100.768..108.690 rows=1000 loops=1)
           Buffers: shared hit=7991 read=134 dirtied=129
           WAL: records=1000 fpi=123 bytes=1056683
           I/O Timings: read=99.223 write=0.000
           ->  Nested Loop  (cost=121.18..3025.81 rows=807 width=38) (actual time=100.088..103.170 rows=1000 loops=1)
                 Buffers: shared hit=5874 read=128
                 I/O Timings: read=95.678 write=0.000
                 ->  HashAggregate  (cost=120.61..128.68 rows=807 width=40) (actual time=100.059..100.335 rows=1000 loops=1)
                       Group Key: "ANY_subquery".id
                       Buffers: shared hit=874 read=128
                       I/O Timings: read=95.678 write=0.000
                       ->  Subquery Scan on ANY_subquery  (cost=0.57..118.59 rows=807 width=40) (actual time=5.359..99.063 rows=1000 loops=1)
                             Buffers: shared hit=874 read=128
                             I/O Timings: read=95.678 write=0.000
                             ->  Limit  (cost=0.57..110.52 rows=807 width=8) (actual time=5.340..98.506 rows=1000 loops=1)
                                   Buffers: shared hit=874 read=128
                                   I/O Timings: read=95.678 write=0.000
                                   ->  Index Scan using authentication_events_pkey on public.authentication_events authentication_events_1  (cost=0.57..110.52 rows=807 width=8) (actual time=5.337..98.231 rows=1000 loops=1)
                                         Index Cond: ((authentication_events_1.id >= 20069040) AND (authentication_events_1.id <= 20757411) AND (authentication_events_1.id >= 20069040) AND (authentication_events_1.id < 20070040))
                                         Filter: (authentication_events_1.created_at <= '2024-08-27 00:00:00+00'::timestamp with time zone)
                                         Rows Removed by Filter: 0
                                         Buffers: shared hit=874 read=128
                                         I/O Timings: read=95.678 write=0.000
                 ->  Index Scan using authentication_events_pkey on public.authentication_events  (cost=0.57..3.59 rows=1 width=14) (actual time=0.002..0.002 rows=1 loops=1000)
                       Index Cond: (authentication_events.id = "ANY_subquery".id)
                       Buffers: shared hit=5000
                       I/O Timings: read=0.000 write=0.000
   ->  CTE Scan on deleted_records  (cost=0.00..18.16 rows=807 width=130) (actual time=100.777..109.522 rows=1000 loops=1)
         Buffers: shared hit=7991 read=134 dirtied=129
         WAL: records=1000 fpi=123 bytes=1056683
         I/O Timings: read=99.223 write=0.000
Settings: effective_cache_size = '472585MB', jit = 'off', work_mem = '100MB', random_page_cost = '1.5', seq_page_cost = '4'

Time: 113.705 ms  
  - planning: 1.782 ms  
  - execution: 111.923 ms  
    - I/O read: 99.223 ms  
    - I/O write: 0.364 ms  
  
Shared buffers:  
  - hits: 9019 (~70.50 MiB) from the buffer pool  
  - reads: 134 (~1.00 MiB) from the OS file cache, including disk I/O  
  - dirtied: 143 (~1.10 MiB)  
  - writes: 16 (~128.00 KiB) 

Compare operational and archive table schemas:

`authentication_events` schema
-- structure.sql 

CREATE TABLE authentication_events (
    id bigint NOT NULL,
    created_at timestamp with time zone NOT NULL,
    user_id bigint,
    result smallint NOT NULL,
    ip_address inet,
    provider text NOT NULL,
    user_name text NOT NULL,
    CONSTRAINT check_45a6cc4e80 CHECK ((char_length(user_name) <= 255)),
    CONSTRAINT check_c64f424630 CHECK ((char_length(provider) <= 64))
);

-- psql \d+ authentication_events

   Column   |           Type           | Collation | Nullable |                      Default                      | Storage  | Compression | Stats target | Description
------------+--------------------------+-----------+----------+---------------------------------------------------+----------+-------------+--------------+-------------
 id         | bigint                   |           | not null | nextval('authentication_events_id_seq'::regclass) | plain    |             |              |
 created_at | timestamp with time zone |           | not null |                                                   | plain    |             |              |
 user_id    | bigint                   |           |          |                                                   | plain    |             |              |
 result     | smallint                 |           | not null |                                                   | plain    |             |              |
 ip_address | inet                     |           |          |                                                   | main     |             |              |
 provider   | text                     |           | not null |                                                   | extended |             |              |
 user_name  | text                     |           | not null |                                                   | extended |             |              |
Indexes:
    "authentication_events_pkey" PRIMARY KEY, btree (id)
    "index_authentication_events_on_created_at" btree (created_at)
    "index_authentication_events_on_provider" btree (provider)
    "index_authentication_events_on_user_and_ip_address_and_result" btree (user_id, ip_address, result)
    "index_successful_authentication_events_for_metrics" btree (user_id, provider, created_at) WHERE result = 1
Check constraints:
    "check_45a6cc4e80" CHECK (char_length(user_name) <= 255)
    "check_c64f424630" CHECK (char_length(provider) <= 64)
Foreign-key constraints:
    "fk_rails_b204656a54" FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE SET NULL
Access method: heap
`authentication_event_archived_records` schema
-- structure.sql

CREATE TABLE authentication_event_archived_records (
    id bigint NOT NULL,
    created_at timestamp with time zone NOT NULL,
    user_id bigint,
    result smallint NOT NULL,
    ip_address inet,
    provider text NOT NULL,
    user_name text NOT NULL,
    archived_at timestamp with time zone DEFAULT CURRENT_TIMESTAMP,
    CONSTRAINT check_0a55f02112 CHECK ((char_length(provider) <= 64)),
    CONSTRAINT check_3b5f782c5d CHECK ((char_length(user_name) <= 255))
);

-- psql \d+ authentication_events_archived_records

                                          Table "public.authentication_event_archived_records"
   Column    |           Type           | Collation | Nullable |      Default      | Storage  | Compression | Stats target | Description
-------------+--------------------------+-----------+----------+-------------------+----------+-------------+--------------+-------------
 id          | bigint                   |           | not null |                   | plain    |             |              |
 created_at  | timestamp with time zone |           | not null |                   | plain    |             |              |
 user_id     | bigint                   |           |          |                   | plain    |             |              |
 result      | smallint                 |           | not null |                   | plain    |             |              |
 ip_address  | inet                     |           |          |                   | main     |             |              |
 provider    | text                     |           | not null |                   | extended |             |              |
 user_name   | text                     |           | not null |                   | extended |             |              |
 archived_at | timestamp with time zone |           |          | CURRENT_TIMESTAMP | plain    |             |              |
Indexes:
    "authentication_event_archived_records_pkey" PRIMARY KEY, btree (id)
Check constraints:
    "check_0a55f02112" CHECK (char_length(provider) <= 64)
    "check_3b5f782c5d" CHECK (char_length(user_name) <= 255)
Access method: heap

How to set up and validate locally

Run migrations, run tests.

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #545007

Edited by Jason Knabl

Merge request reports

Loading