Shard spam_logs table
What does this MR do and why?
This change implements database sharding for the spam logs table, which stores records of users flagged by anti-spam systems. The main goal is to organize spam log data by organization to improve database performance and scalability.
The changes include:
- Configuration update: The spam logs table is now configured to use organization_id as its sharding key, replacing the previous "desired" sharding setup with an active one.
- Database trigger: A new automatic trigger is added that ensures every spam log record gets the correct organization_id by looking it up from the associated user record when a spam log is created or updated.
- Data validation: A constraint is added to require that all spam log records must have an organization_id (no empty values allowed).
- Data backfill: A migration script updates existing spam log records to populate any missing organization_id values by triggering the automatic lookup mechanism.
- Database structure: The underlying database schema is updated to include the new trigger function and constraint.
These changes ensure that spam logs are properly organized by organization, which will help the system handle larger amounts of data more efficiently by distributing the workload across different database partitions based on organization boundaries.
select * from spam_logs where user_id is null;
Index Scan using index_spam_logs_on_user_id on public.spam_logs (cost=0.43..2.78 rows=1 width=545) (actual time=0.005..0.005 rows=0 loops=1)
Index Cond: (spam_logs.user_id IS NULL)
Buffers: shared hit=3
I/O Timings: read=0.000 write=0.000
Settings: seq_page_cost = '4', work_mem = '100MB', random_page_cost = '1.5', jit = 'off', effective_cache_size = '472585MB'
https://console.postgres.ai/gitlab/gitlab-production-main/sessions/45438/commands/139305
UPDATE "spam_logs"
SET
"organization_id" = 1
WHERE spam_logs.organization_id IS NULL
AND spam_logs.user_id IS NULL;
Time: 0.603 ms
- planning: 0.527 ms
- execution: 0.076 ms
- I/O read: 0.000 ms
- I/O write: 0.000 ms
Shared buffers:
- hits: 3 (~24.00 KiB) from the buffer pool
- reads: 0 from the OS file cache, including disk I/O
- dirtied: 0
- writes: 0
https://console.postgres.ai/gitlab/gitlab-production-main/sessions/45438/commands/139323
Worst case scenario without batching:
UPDATE "spam_logs"
SET
updated_at = updated_at
WHERE spam_logs.organization_id IS NULL
AND spam_logs.user_id IS NOT NULL;
Time: 19.142 s
- planning: 1.999 ms
- execution: 19.140 s
- I/O read: 1.358 s
- I/O write: 1.441 s
Shared buffers:
- hits: 19028296 (~145.20 GiB) from the buffer pool
- reads: 104122 (~813.50 MiB) from the OS file cache, including disk I/O
- dirtied: 212550 (~1.60 GiB)
- writes: 97400 (~760.90 MiB)
https://console.postgres.ai/gitlab/gitlab-production-main/sessions/45438/commands/139329
References
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.