Archive and delete authentication_events once per day
What does this MR do and why?
Add a Sidekiq worker that archives and deletes authentication_events records that were created over one year ago. This enforces the one year retention period we have set for this table.
The job logic is mostly the same as the Batched Background Migration we already ran successfully in !202447 (merged). So the queries to archive and delete have already been validated in production.
Set up the cron infrastructure to enqueue the worker daily at 6:10 AM server time. I've chosen 6:10 AM as it is a low-traffic time and 5 minutes after we enqueue the related job to archive oauth_access_tokens (!202767 (merged)).
The worker's internal logic is skipped when the archive_authentication_events feature flag is disabled, which is the default. We'll have to enable the feature flag before the job will start deleting records.
References
Feature flag rollout issue: #571771
Issue: #545007 (closed)
Parent issue: #555382
Queries and performance
- The job now proceeds using a sub-batching strategy
- Batches first on
idand then appliescreated_atfilter on sub-batches
- Batches first on
- Optimizes by only calling the CTE when records exist
- Avoids firing the most expensive query when sub-batch is empty, which most sub-batches will be
We also needed a query to find the maximum record ID in the outer batch. We need to memoize this value for the case when the job runtime goes over the limit, and we need to re-enqueue the job by passing in the maximum record ID as cursor value. If we don't memoize the value, and the last batch before the time limit is empty, then the call to batch.last.id will fail.
Query sequence
I seeded a few million records locally and ran the job, this is the query sequence for a batch:
EXPAND ME to see SQL query sequence
-- Find maximum boundary for outer batch
SELECT "authentication_events"."id" FROM "authentication_events" WHERE (id > 0) AND "authentication_events"."id" >= 52221278 ORDER BY "authentication_events"."id" ASC LIMIT 1 OFFSET 10000
-- Get ID of last ID in batch (necessary for the case of re-enqueueing the job)
SELECT "authentication_events".* FROM "authentication_events" WHERE (id > 0) AND "authentication_events"."id" >= 52221278 AND "authentication_events"."id" < 52241387 ORDER BY "authentication_events"."id" DESC LIMIT 1
-- Find sub-batch start ID
SELECT "authentication_events"."id" FROM "authentication_events" WHERE (id > 0) AND "authentication_events"."id" >= 52221278 AND "authentication_events"."id" < 52241387 ORDER BY "authentication_events"."id" ASC LIMIT 1
-- Find sub batch end ID
SELECT "authentication_events"."id" FROM "authentication_events" WHERE (id > 0) AND "authentication_events"."id" >= 52221278 AND "authentication_events"."id" < 52241387 AND "authentication_events"."id" >= 52221278 ORDER BY "authentication_events"."id" ASC LIMIT 1 OFFSET 1000
-- CTE short-circuit optimization: check if any records exist
AuthenticationEvent Exists? (0.4ms) SELECT 1 AS one FROM "authentication_events" WHERE (id > 0) AND "authentication_events"."id" >= 52221278 AND "authentication_events"."id" < 52241387 AND "authentication_events"."id" >= 52221278 AND "authentication_events"."id" < 52224969 AND "authentication_events"."created_at" <= '2024-10-08 00:00:00' LIMIT 1
-- Find next sub batch ID
AuthenticationEvent Load (1.2ms) SELECT "authentication_events"."id" FROM "authentication_events" WHERE (id > 0) AND "authentication_events"."id" >= 52221278 AND "authentication_events"."id" < 52241387 AND "authentication_events"."id" >= 52224969 ORDER BY "authentication_events"."id" ASC LIMIT 1 OFFSET 1000
AuthenticationEvent Exists? (0.6ms) SELECT 1 AS one FROM "authentication_events" WHERE (id > 0) AND "authentication_events"."id" >= 52221278 AND "authentication_events"."id" < 52241387 AND "authentication_events"."id" >= 52224969 AND "authentication_events"."id" < 52228605 AND "authentication_events"."created_at" <= '2024-10-08 00:00:00' LIMIT 1
-- Find next sub batch ID
AuthenticationEvent Load (1.3ms) SELECT "authentication_events"."id" FROM "authentication_events" WHERE (id > 0) AND "authentication_events"."id" >= 52221278 AND "authentication_events"."id" < 52241387 AND "authentication_events"."id" >= 52228605 ORDER BY "authentication_events"."id" ASC LIMIT 1 OFFSET 1000
AuthenticationEvent Exists? (0.5ms) SELECT 1 AS one FROM "authentication_events" WHERE (id > 0) AND "authentication_events"."id" >= 52221278 AND "authentication_events"."id" < 52241387 AND "authentication_events"."id" >= 52228605 AND "authentication_events"."id" < 52232345 AND "authentication_events"."created_at" <= '2024-10-08 00:00:00' LIMIT 1
-- Find next sub batch ID
AuthenticationEvent Load (1.6ms) SELECT "authentication_events"."id" FROM "authentication_events" WHERE (id > 0) AND "authentication_events"."id" >= 52221278 AND "authentication_events"."id" < 52241387 AND "authentication_events"."id" >= 52232345 ORDER BY "authentication_events"."id" ASC LIMIT 1 OFFSET 1000
AuthenticationEvent Exists? (0.7ms) SELECT 1 AS one FROM "authentication_events" WHERE (id > 0) AND "authentication_events"."id" >= 52221278 AND "authentication_events"."id" < 52241387 AND "authentication_events"."id" >= 52232345 AND "authentication_events"."id" < 52235387 AND "authentication_events"."created_at" <= '2024-10-08 00:00:00' LIMIT 1
-- Records exist: delete and archive CTE
WITH deleted_records AS MATERIALIZED (
DELETE FROM authentication_events
WHERE id IN (SELECT "authentication_events"."id" FROM "authentication_events" WHERE (id > 0) AND "authentication_events"."id" >= 52221278 AND "authentication_events"."id" < 52241387 AND "authentication_events"."id" >= 52232345 AND "authentication_events"."id" < 52235387 AND "authentication_events"."created_at" <= '2024-10-08 00:00:00' LIMIT 1000)
RETURNING *
)
INSERT INTO authentication_event_archived_records
(id, created_at, user_id, result, ip_address, provider, user_name, archived_at)
SELECT
id, created_at, user_id, result, ip_address, provider, user_name, CURRENT_TIMESTAMP as archived_at
FROM deleted_records
AuthenticationEvent Load (1.0ms) SELECT "authentication_events"."id" FROM "authentication_events" WHERE (id > 0) AND "authentication_events"."id" >= 52221278 AND "authentication_events"."id" < 52241387 AND "authentication_events"."id" >= 52235387 ORDER BY "authentication_events"."id" ASC LIMIT 1 OFFSET 1000
-- ...etc.
CTE performance
- Query plans of CTE runs simulating a batch of records:
Other query performance
- Outer batch limit ID select: https://console.postgres.ai/gitlab/gitlab-production-main/sessions/44309/commands/135788
- Batch maximum ID select: https://console.postgres.ai/gitlab/gitlab-production-main/sessions/44309/commands/135789
- Sub-batch start ID select: https://console.postgres.ai/gitlab/gitlab-production-main/sessions/44309/commands/135790
-
exists?query
Is this reversible?
- The worker deletes and archives rows as an atomic operation, so any deleted row can be restored from archive.
- To restore rows from archive, we'd need to
INSERTback intoauthentication_events, likely with a data migration.
For example, this query will restore 3 records from the archive:
INSERT INTO authentication_events
(id, created_at, user_id, result, ip_address, provider, user_name)
SELECT
id, created_at, user_id, result, ip_address, provider, user_name
FROM authentication_event_archived_records archive WHERE archive.id IN (1, 2, 3)
Possible impact of error?
No critical flows would be blocked in case of error. Any errors would largely result in under-reporting data:
- Show fewer auth logs to users than we should
- Consider fewer auth log entries in "most recent IP address" calculation, causing e-mail verification to trigger more frequently than normal
- "Most used IP address" on abuse reports detail page considers fewer IPs than normal
All these impacts would be fixed as soon as deleted records are restored from archive.
How to set up and validate locally
- Seed some
authentication_eventsrecords. Try a mix of records over a year old and under a year old. This handy seed script will do this for you. - Run the worker via Rails console:
Authn::DataRetention::AuthenticationEventArchiveWorker.new.perform
- Observe the worker didn't delete anything, because the feature flag is disabled.
- Enable the feature flag via Rails console:
Feature.enable(:archive_authentication_events)
- Re-enqueue the job via Rails console:
Authn::DataRetention::AuthenticationEventArchiveWorker.new.perform
- Observe records older than 1 year are deleted, and records newer than 1 year are not. You can do this with a psql query:
gdk psql
SELECT * FROM authentication_events WHERE created_at < (now() - interval '1 year') -- records out of retention, will decrease as the job runs
SELECT * FROM authentication_events WHERE created_at >= (now() - interval '1 year') -- records in retention, should not change
Generate seed data
To generate seed data, you can use this script:
Create file `lib/tasks/seed_authn_events.rake`
# frozen_string_literal: true
namespace :gitlab do
namespace :seed do
desc 'Seed a whole lot of authentication_events'
task :authentication_events, [:num_batches, :batch_size] => :gitlab_environment do |_t, args|
username = 'root'
user = User.find_by(username: username)
last_user_id = User.maximum(:id)
num_batches = (args[:num_batches] || 10).to_i
batch_size = (args[:batch_size] || 100).to_i
num_processes = 12
puts "Creating #{num_batches} batches of #{batch_size} authentication events each using #{num_processes} processes."
# Distribute batches evenly, handling remainder
base_batches_per_process = num_batches / num_processes
remainder = num_batches % num_processes
pids = (0...num_processes).map do |i|
# Give extra batch to first 'remainder' processes
batches_for_this_process = base_batches_per_process + (i < remainder ? 1 : 0)
start_batch = i * base_batches_per_process + [i, remainder].min + 1
end_batch = start_batch + batches_for_this_process - 1
fork do
# Reconnect to database in child process
ActiveRecord::Base.establish_connection
puts "Process #{Process.pid}: handling batches #{start_batch}..#{end_batch}"
(start_batch..end_batch).each do |batch_number|
puts "Process #{Process.pid}: Starting batch #{batch_number}"
use_deleted_user = rand(100) < 5
attrs_to_insert = (1..batch_size).map do |_i|
{
user_id: use_deleted_user ? last_user_id + 1 : user.id,
user_name: user.username,
provider: 'standard',
ip_address: FFaker::Internet.ip_v4_address,
result: [:failed, :success].sample,
created_at: rand(48).months.ago
}
end
begin
ActiveRecord::Base.connection.execute("ALTER TABLE authentication_events DISABLE TRIGGER ALL")
AuthenticationEvent.insert_all(attrs_to_insert)
ensure
ActiveRecord::Base.connection.execute("ALTER TABLE authentication_events ENABLE TRIGGER ALL")
end
puts "Process #{Process.pid}: Completed batch #{batch_number}"
end
# Clean exit from child process
exit 0
end
end
# Wait for all child processes
pids.each { |pid| Process.wait(pid) }
total_records = num_batches * batch_size
puts "Successfully created #{total_records} authentication events."
end
end
end
Run it with:
bundle exec rake "gitlab:seed:authentication_events[20, 100000]"
...which would create 2,000,000 records (20 batches of 100k).
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
Related to #545007 (closed)