Change AuthenticationEventArchiveWorker deletion strategy from soft deletion to hard deletion

What does this MR do and why?

Previous context

  • In !207500 (merged) we introduced a background worker that enforces a data retention policy on authentication_events
  • The worker runs once a day
  • The worker stores old rows in an archive table (authentication_event_archived_records) and then deletes the old rows from authentication_events
  • The archival was temporary to enable soft deleting records
  • We've verified no negative impact from deleting authentication_events and no longer need the archive table

This MR

  • Change the AuthenticationEventArchiveWorker worker to hard delete records instead of soft delete them

  • With this change the job no longer references authentication_event_archived_records, which gets us closer to being able to drop the archive table.

  • The worker is scheduled and runs via cron once a day

  • The worker gets enqueued in all environments but the logic is gated behind a feature flag check, so in practice the job only deletes records in the GitLab.com environment

  • Therefore this change is only expected to impact GitLab.com and not other users

References

SQL Queries

Set up outer batch
SELECT "authentication_events"."id" FROM "authentication_events" WHERE (id > 0) ORDER BY "authentication_events"."id" ASC LIMIT 1
SELECT "authentication_events"."id" FROM "authentication_events" WHERE (id > 0) AND "authentication_events"."id" >= 56100052 ORDER BY "authentication_events"."id" ASC LIMIT 1 OFFSET 10000
Get last ID in the batch (needed for job re-enqueue)
SELECT "authentication_events".* FROM "authentication_events" WHERE (id > 0) AND "authentication_events"."id" >= 56100052 AND "authentication_events"."id" < 56110052 ORDER BY "authentication_events"."id" DESC LIMIT 1
Set up inner batch
SELECT "authentication_events"."id" FROM "authentication_events" WHERE (id > 0) AND "authentication_events"."id" >= 56100052 AND "authentication_events"."id" < 56110052 ORDER BY "authentication_events"."id" ASC LIMIT 1
SELECT "authentication_events"."id" FROM "authentication_events" WHERE (id > 0) AND "authentication_events"."id" >= 56100052 AND "authentication_events"."id" < 56110052 AND "authentication_events"."id" >= 56100052 ORDER BY "authentication_events"."id" ASC LIMIT 1 OFFSET 1000
Check existence before deleting
SELECT 1 AS one FROM "authentication_events" WHERE ... LIMIT 1
Delete rows
WITH batch AS MATERIALIZED (
      SELECT "authentication_events"."id", "authentication_events"."created_at" FROM "authentication_events" WHERE (id > 0) AND "authentication_events"."id" >= 56100052 AND "authentication_events"."id" < 56110052 AND "authentication_events"."id" >= 56100052 AND "authentication_events"."id" < 56101052 LIMIT 1000
  ),
  filtered_batch AS MATERIALIZED (
    SELECT id, created_at FROM batch
    WHERE created_at <= '2024-11-19 00:00:00' LIMIT 1000
  )
  DELETE FROM authentication_events
  WHERE id IN (SELECT id FROM filtered_batch)
Get next inner batch (repeats...)
SELECT "authentication_events"."id" FROM "authentication_events" WHERE (id > 0) AND "authentication_events"."id" >= 56100052 AND "authentication_events"."id" < 56110052 AND "authentication_events"."id" >= 56101052 ORDER BY "authentication_events"."id" ASC LIMIT 1 OFFSET 1000

CTE Query Plan

Screenshots or screen recordings

How to set up and validate locally

Instructions from the original MR: !207500 (merged)

  • Seed some authentication_events records. Try a mix of records over a year old and under a year old. This handy seed script will do this for you.
  • Run the worker via Rails console:
Authn::DataRetention::AuthenticationEventArchiveWorker.new.perform
  • Observe the worker didn't delete anything, because the feature flag is disabled.
  • Enable the feature flag via Rails console:
Feature.enable(:archive_authentication_events)
  • Re-enqueue the job via Rails console:
Authn::DataRetention::AuthenticationEventArchiveWorker.new.perform
  • Observe records older than 1 year are deleted, and records newer than 1 year are not. You can do this with a psql query:
gdk psql
SELECT * FROM authentication_events WHERE created_at < (now() - interval '1 year') -- records out of retention, will decrease as the job runs
SELECT * FROM authentication_events WHERE created_at >= (now() - interval '1 year') -- records in retention, should not change

Generate seed data

To generate seed data, you can use this script:

Create file `lib/tasks/seed_authn_events.rake`
# frozen_string_literal: true
namespace :gitlab do
  namespace :seed do
    desc 'Seed a whole lot of authentication_events'
    task :authentication_events, [:num_batches, :batch_size] => :gitlab_environment do |_t, args|
      username = 'root'
      user = User.find_by(username: username)
      last_user_id = User.maximum(:id)

      num_batches = (args[:num_batches] || 10).to_i
      batch_size = (args[:batch_size] || 100).to_i
      num_processes = 12

      puts "Creating #{num_batches} batches of #{batch_size} authentication events each using #{num_processes} processes."

      # Distribute batches evenly, handling remainder
      base_batches_per_process = num_batches / num_processes
      remainder = num_batches % num_processes


      pids = (0...num_processes).map do |i|
        # Give extra batch to first 'remainder' processes
        batches_for_this_process = base_batches_per_process + (i < remainder ? 1 : 0)
        start_batch = i * base_batches_per_process + [i, remainder].min + 1
        end_batch = start_batch + batches_for_this_process - 1

        fork do
          # Reconnect to database in child process
          ActiveRecord::Base.establish_connection

          puts "Process #{Process.pid}: handling batches #{start_batch}..#{end_batch}"

          (start_batch..end_batch).each do |batch_number|
            puts "Process #{Process.pid}: Starting batch #{batch_number}"
            use_deleted_user = rand(100) < 5

            attrs_to_insert = (1..batch_size).map do |_i|
              {
                user_id: use_deleted_user ? last_user_id + 1 : user.id,
                user_name: user.username,
                provider: 'standard',
                ip_address: FFaker::Internet.ip_v4_address,
                result: [:failed, :success].sample,
                created_at: rand(48).months.ago
              }
            end

            begin
              ActiveRecord::Base.connection.execute("ALTER TABLE authentication_events DISABLE TRIGGER ALL")
              AuthenticationEvent.insert_all(attrs_to_insert)
            ensure
              ActiveRecord::Base.connection.execute("ALTER TABLE authentication_events ENABLE TRIGGER ALL")
            end

            puts "Process #{Process.pid}: Completed batch #{batch_number}"
          end

          # Clean exit from child process
          exit 0
        end
      end

      # Wait for all child processes
      pids.each { |pid| Process.wait(pid) }

      total_records = num_batches * batch_size
      puts "Successfully created #{total_records} authentication events."
    end
  end
end

Run it with:

bundle exec rake "gitlab:seed:authentication_events[20, 100000]"

...which would create 2,000,000 records (20 batches of 100k).

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #562154

Edited by Jason Knabl

Merge request reports

Loading