Archive and delete authentication_events once per day

What does this MR do and why?

Add a Sidekiq worker that archives and deletes authentication_events records that were created over one year ago. This enforces the one year retention period we have set for this table.

The job logic is mostly the same as the Batched Background Migration we already ran successfully in !202447 (merged). So the queries to archive and delete have already been validated in production.

Set up the cron infrastructure to enqueue the worker daily at 6:10 AM server time. I've chosen 6:10 AM as it is a low-traffic time and 5 minutes after we enqueue the related job to archive oauth_access_tokens (!202767 (merged)).

The worker's internal logic is skipped when the archive_authentication_events feature flag is disabled, which is the default. We'll have to enable the feature flag before the job will start deleting records.

References

Feature flag rollout issue: #571771
Issue: #545007 (closed)
Parent issue: #555382

Queries and performance

  • The job now proceeds using a sub-batching strategy
    • Batches first on id and then applies created_at filter on sub-batches
  • Optimizes by only calling the CTE when records exist
    • Avoids firing the most expensive query when sub-batch is empty, which most sub-batches will be

We also needed a query to find the maximum record ID in the outer batch. We need to memoize this value for the case when the job runtime goes over the limit, and we need to re-enqueue the job by passing in the maximum record ID as cursor value. If we don't memoize the value, and the last batch before the time limit is empty, then the call to batch.last.id will fail.

Query sequence

I seeded a few million records locally and ran the job, this is the query sequence for a batch:

EXPAND ME to see SQL query sequence
-- Find maximum boundary for outer batch
SELECT "authentication_events"."id" FROM "authentication_events" WHERE (id > 0) AND "authentication_events"."id" >= 52221278 ORDER BY "authentication_events"."id" ASC LIMIT 1 OFFSET 10000

-- Get ID of last ID in batch (necessary for the case of re-enqueueing the job)
SELECT "authentication_events".* FROM "authentication_events" WHERE (id > 0) AND "authentication_events"."id" >= 52221278 AND "authentication_events"."id" < 52241387 ORDER BY "authentication_events"."id" DESC LIMIT 1

-- Find sub-batch start ID
SELECT "authentication_events"."id" FROM "authentication_events" WHERE (id > 0) AND "authentication_events"."id" >= 52221278 AND "authentication_events"."id" < 52241387 ORDER BY "authentication_events"."id" ASC LIMIT 1
-- Find sub batch end ID
SELECT "authentication_events"."id" FROM "authentication_events" WHERE (id > 0) AND "authentication_events"."id" >= 52221278 AND "authentication_events"."id" < 52241387 AND "authentication_events"."id" >= 52221278 ORDER BY "authentication_events"."id" ASC LIMIT 1 OFFSET 1000 

-- CTE short-circuit optimization: check if any records exist
AuthenticationEvent Exists? (0.4ms)  SELECT 1 AS one FROM "authentication_events" WHERE (id > 0) AND "authentication_events"."id" >= 52221278 AND "authentication_events"."id" < 52241387 AND "authentication_events"."id" >= 52221278 AND "authentication_events"."id" < 52224969 AND "authentication_events"."created_at" <= '2024-10-08 00:00:00' LIMIT 1

-- Find next sub batch ID
AuthenticationEvent Load (1.2ms)  SELECT "authentication_events"."id" FROM "authentication_events" WHERE (id > 0) AND "authentication_events"."id" >= 52221278 AND "authentication_events"."id" < 52241387 AND "authentication_events"."id" >= 52224969 ORDER BY "authentication_events"."id" ASC LIMIT 1 OFFSET 1000
AuthenticationEvent Exists? (0.6ms)  SELECT 1 AS one FROM "authentication_events" WHERE (id > 0) AND "authentication_events"."id" >= 52221278 AND "authentication_events"."id" < 52241387 AND "authentication_events"."id" >= 52224969 AND "authentication_events"."id" < 52228605 AND "authentication_events"."created_at" <= '2024-10-08 00:00:00' LIMIT 1

-- Find next sub batch ID
AuthenticationEvent Load (1.3ms)  SELECT "authentication_events"."id" FROM "authentication_events" WHERE (id > 0) AND "authentication_events"."id" >= 52221278 AND "authentication_events"."id" < 52241387 AND "authentication_events"."id" >= 52228605 ORDER BY "authentication_events"."id" ASC LIMIT 1 OFFSET 1000
AuthenticationEvent Exists? (0.5ms)  SELECT 1 AS one FROM "authentication_events" WHERE (id > 0) AND "authentication_events"."id" >= 52221278 AND "authentication_events"."id" < 52241387 AND "authentication_events"."id" >= 52228605 AND "authentication_events"."id" < 52232345 AND "authentication_events"."created_at" <= '2024-10-08 00:00:00' LIMIT 1

-- Find next sub batch ID
AuthenticationEvent Load (1.6ms)  SELECT "authentication_events"."id" FROM "authentication_events" WHERE (id > 0) AND "authentication_events"."id" >= 52221278 AND "authentication_events"."id" < 52241387 AND "authentication_events"."id" >= 52232345 ORDER BY "authentication_events"."id" ASC LIMIT 1 OFFSET 1000 
AuthenticationEvent Exists? (0.7ms)  SELECT 1 AS one FROM "authentication_events" WHERE (id > 0) AND "authentication_events"."id" >= 52221278 AND "authentication_events"."id" < 52241387 AND "authentication_events"."id" >= 52232345 AND "authentication_events"."id" < 52235387 AND "authentication_events"."created_at" <= '2024-10-08 00:00:00' LIMIT 1

-- Records exist: delete and archive CTE
WITH deleted_records AS MATERIALIZED (
    DELETE FROM authentication_events
    WHERE id IN (SELECT "authentication_events"."id" FROM "authentication_events" WHERE (id > 0) AND "authentication_events"."id" >= 52221278 AND "authentication_events"."id" < 52241387 AND "authentication_events"."id" >= 52232345 AND "authentication_events"."id" < 52235387 AND "authentication_events"."created_at" <= '2024-10-08 00:00:00' LIMIT 1000)
    RETURNING *
)
INSERT INTO authentication_event_archived_records
  (id, created_at, user_id, result, ip_address, provider, user_name, archived_at)
SELECT
  id, created_at, user_id, result, ip_address, provider, user_name, CURRENT_TIMESTAMP as archived_at
FROM deleted_records
AuthenticationEvent Load (1.0ms)  SELECT "authentication_events"."id" FROM "authentication_events" WHERE (id > 0) AND "authentication_events"."id" >= 52221278 AND "authentication_events"."id" < 52241387 AND "authentication_events"."id" >= 52235387 ORDER BY "authentication_events"."id" ASC LIMIT 1 OFFSET 1000

-- ...etc.

CTE performance

Other query performance

Is this reversible?

  • The worker deletes and archives rows as an atomic operation, so any deleted row can be restored from archive.
  • To restore rows from archive, we'd need to INSERT back into authentication_events, likely with a data migration.

For example, this query will restore 3 records from the archive:

INSERT INTO authentication_events
              (id, created_at, user_id, result, ip_address, provider, user_name)
            SELECT
              id, created_at, user_id, result, ip_address, provider, user_name
            FROM authentication_event_archived_records archive WHERE archive.id IN (1, 2, 3)

Possible impact of error?

No critical flows would be blocked in case of error. Any errors would largely result in under-reporting data:

  • Show fewer auth logs to users than we should
  • Consider fewer auth log entries in "most recent IP address" calculation, causing e-mail verification to trigger more frequently than normal
  • "Most used IP address" on abuse reports detail page considers fewer IPs than normal

All these impacts would be fixed as soon as deleted records are restored from archive.

How to set up and validate locally

  • Seed some authentication_events records. Try a mix of records over a year old and under a year old. This handy seed script will do this for you.
  • Run the worker via Rails console:
Authn::DataRetention::AuthenticationEventArchiveWorker.new.perform
  • Observe the worker didn't delete anything, because the feature flag is disabled.
  • Enable the feature flag via Rails console:
Feature.enable(:archive_authentication_events)
  • Re-enqueue the job via Rails console:
Authn::DataRetention::AuthenticationEventArchiveWorker.new.perform
  • Observe records older than 1 year are deleted, and records newer than 1 year are not. You can do this with a psql query:
gdk psql
SELECT * FROM authentication_events WHERE created_at < (now() - interval '1 year') -- records out of retention, will decrease as the job runs
SELECT * FROM authentication_events WHERE created_at >= (now() - interval '1 year') -- records in retention, should not change

Generate seed data

To generate seed data, you can use this script:

Create file `lib/tasks/seed_authn_events.rake`
# frozen_string_literal: true
namespace :gitlab do
  namespace :seed do
    desc 'Seed a whole lot of authentication_events'
    task :authentication_events, [:num_batches, :batch_size] => :gitlab_environment do |_t, args|
      username = 'root'
      user = User.find_by(username: username)
      last_user_id = User.maximum(:id)

      num_batches = (args[:num_batches] || 10).to_i
      batch_size = (args[:batch_size] || 100).to_i
      num_processes = 12

      puts "Creating #{num_batches} batches of #{batch_size} authentication events each using #{num_processes} processes."

      # Distribute batches evenly, handling remainder
      base_batches_per_process = num_batches / num_processes
      remainder = num_batches % num_processes


      pids = (0...num_processes).map do |i|
        # Give extra batch to first 'remainder' processes
        batches_for_this_process = base_batches_per_process + (i < remainder ? 1 : 0)
        start_batch = i * base_batches_per_process + [i, remainder].min + 1
        end_batch = start_batch + batches_for_this_process - 1

        fork do
          # Reconnect to database in child process
          ActiveRecord::Base.establish_connection

          puts "Process #{Process.pid}: handling batches #{start_batch}..#{end_batch}"

          (start_batch..end_batch).each do |batch_number|
            puts "Process #{Process.pid}: Starting batch #{batch_number}"
            use_deleted_user = rand(100) < 5

            attrs_to_insert = (1..batch_size).map do |_i|
              {
                user_id: use_deleted_user ? last_user_id + 1 : user.id,
                user_name: user.username,
                provider: 'standard',
                ip_address: FFaker::Internet.ip_v4_address,
                result: [:failed, :success].sample,
                created_at: rand(48).months.ago
              }
            end

            begin
              ActiveRecord::Base.connection.execute("ALTER TABLE authentication_events DISABLE TRIGGER ALL")
              AuthenticationEvent.insert_all(attrs_to_insert)
            ensure
              ActiveRecord::Base.connection.execute("ALTER TABLE authentication_events ENABLE TRIGGER ALL")
            end

            puts "Process #{Process.pid}: Completed batch #{batch_number}"
          end

          # Clean exit from child process
          exit 0
        end
      end

      # Wait for all child processes
      pids.each { |pid| Process.wait(pid) }

      total_records = num_batches * batch_size
      puts "Successfully created #{total_records} authentication events."
    end
  end
end

Run it with:

bundle exec rake "gitlab:seed:authentication_events[20, 100000]"

...which would create 2,000,000 records (20 batches of 100k).

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #545007 (closed)

Edited by Jason Knabl

Merge request reports

Loading