Skip to content

Data backfill migration for vulnerability_reads.owasp_top_10

Bala Kumar requested to merge 419092-store-owasp-top-10-data-backfill into master

What does this MR do and why?

This MR introduces a post-deployment migration to schedule the background jobs for backfilling older vulnerability_reads.owasp_top_10 records. For the new records ingestion is happening through !138594 (merged)

Note: Do not merge this one till the other !140067 (merged) is merged

Database

Output of the post-deployment migration

bundle exec rake db:migrate:up:main VERSION=20231214111617

main: == [advisory_lock_connection] object_id: 182100, pg_backend_pid: 14758
main: == 20231214111617 QueueBackfillOwaspTopTenOfVulnerabilityReads: migrating =====
main: == 20231214111617 QueueBackfillOwaspTopTenOfVulnerabilityReads: migrated (0.0567s)

main: == [advisory_lock_connection] object_id: 182100, pg_backend_pid: 14758

bundle exec rake db:migrate:down:main VERSION=20231214111617

main: == [advisory_lock_connection] object_id: 182620, pg_backend_pid: 14300
main: == 20231214111617 QueueBackfillOwaspTopTenOfVulnerabilityReads: reverting =====
main: == 20231214111617 QueueBackfillOwaspTopTenOfVulnerabilityReads: reverted (0.0450s)

main: == [advisory_lock_connection] object_id: 182620, pg_backend_pid: 14300
Select query used in background migration
SELECT vulnerability_reads.vulnerability_id  FROM vulnerability_reads WHERE vulnerability_reads.vulnerability_id BETWEEN 30000 AND 40000 ORDER BY vulnerability_reads.vulnerability_id ASC LIMIT 10000
 Limit  (cost=0.57..759.44 rows=8674 width=8) (actual time=4.142..678.554 rows=10000 loops=1)
   Buffers: shared hit=63 read=884 dirtied=309
   I/O Timings: read=638.531 write=0.000
   ->  Index Only Scan using index_vulnerability_reads_on_vulnerability_id on public.vulnerability_reads  (cost=0.57..759.44 rows=8674 width=8) (actual time=4.140..677.327 rows=10000 loops=1)
         Index Cond: ((vulnerability_reads.vulnerability_id >= 30000) AND (vulnerability_reads.vulnerability_id <= 40000))
         Heap Fetches: 2791
         Buffers: shared hit=63 read=884 dirtied=309
         I/O Timings: read=638.531 write=0.000

https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/25017/commands/79513

Update query used in each sub-batch
UPDATE 
  vulnerability_reads vr 
SET 
  owasp_top_10 = CASE selected_ids.external_id WHEN 'A1:2017-Injection' THEN 1 WHEN 'A1:2017' THEN 1 WHEN 'A2:2017-Broken Authentication' THEN 2 WHEN 'A2:2017' THEN 2 WHEN 'A3:2017-Sensitive Data Exposure' THEN 3 WHEN 'A3:2017' THEN 3 WHEN 'A4:2017-XML External Entities (XXE)' THEN 4 WHEN 'A4:2017' THEN 4 WHEN 'A5:2017-Broken Access Control' THEN 5 WHEN 'A5:2017' THEN 5 WHEN 'A6:2017-Security Misconfiguration' THEN 6 WHEN 'A6:2017' THEN 6 WHEN 'A7:2017-Cross-Site Scripting (XSS)' THEN 7 WHEN 'A7:2017' THEN 7 WHEN 'A8:2017-Insecure Deserialization' THEN 8 WHEN 'A8:2017' THEN 8 WHEN 'A9:2017-Using Components with Known Vulnerabilities' THEN 9 WHEN 'A9:2017' THEN 9 WHEN 'A10:2017-Insufficient Logging & Monitoring' THEN 10 WHEN 'A10:2017' THEN 10 WHEN 'A1:2021-Broken Access Control' THEN 11 WHEN 'A1:2021' THEN 11 WHEN 'A2:2021-Cryptographic Failures' THEN 12 WHEN 'A2:2021' THEN 12 WHEN 'A3:2021-Injection' THEN 13 WHEN 'A3:2021' THEN 13 WHEN 'A4:2021-Insecure Design' THEN 14 WHEN 'A4:2021' THEN 14 WHEN 'A5:2021-Security Misconfiguration' THEN 15 WHEN 'A5:2021' THEN 15 WHEN 'A6:2021-Vulnerable and Outdated Components' THEN 16 WHEN 'A6:2021' THEN 16 WHEN 'A7:2021-Identification and Authentication Failures' THEN 17 WHEN 'A7:2021' THEN 17 WHEN 'A8:2021-Software and Data Integrity Failures' THEN 18 WHEN 'A8:2021' THEN 18 WHEN 'A9:2021-Security Logging and Monitoring Failures' THEN 19 WHEN 'A9:2021' THEN 19 WHEN 'A10:2021-Server-Side Request Forgery' THEN 20 WHEN 'A10:2021' THEN 20 ELSE vr.owasp_top_10 END 
FROM 
  (
    SELECT 
      vr.id, 
      vi.external_id 
    FROM 
      vulnerability_reads vr 
      JOIN vulnerability_occurrences vo ON vr.vulnerability_id = vo.vulnerability_id 
      JOIN vulnerability_occurrence_identifiers voi ON vo.id = voi.occurrence_id 
      JOIN vulnerability_identifiers vi ON voi.identifier_id = vi.id 
    WHERE 
      (
        LOWER(vi.external_type) = 'owasp'
      ) 
      AND "vi"."external_id" IN (
        'A1:2017-Injection', 'A1:2017', 'A2:2017-Broken Authentication', 
        'A2:2017', 'A3:2017-Sensitive Data Exposure', 
        'A3:2017', 'A4:2017-XML External Entities (XXE)', 
        'A4:2017', 'A5:2017-Broken Access Control', 
        'A5:2017', 'A6:2017-Security Misconfiguration', 
        'A6:2017', 'A7:2017-Cross-Site Scripting (XSS)', 
        'A7:2017', 'A8:2017-Insecure Deserialization', 
        'A8:2017', 'A9:2017-Using Components with Known Vulnerabilities', 
        'A9:2017', 'A10:2017-Insufficient Logging & Monitoring', 
        'A10:2017', 'A1:2021-Broken Access Control', 
        'A1:2021', 'A2:2021-Cryptographic Failures', 
        'A2:2021', 'A3:2021-Injection', 
        'A3:2021', 'A4:2021-Insecure Design', 
        'A4:2021', 'A5:2021-Security Misconfiguration', 
        'A5:2021', 'A6:2021-Vulnerable and Outdated Components', 
        'A6:2021', 'A7:2021-Identification and Authentication Failures', 
        'A7:2021', 'A8:2021-Software and Data Integrity Failures', 
        'A8:2021', 'A9:2021-Security Logging and Monitoring Failures', 
        'A9:2021', 'A10:2021-Server-Side Request Forgery', 
        'A10:2021'
      ) 
      AND vr.id IN (
        SELECT 
          vulnerability_reads.id 
        FROM 
          vulnerability_reads 
        WHERE 
          vulnerability_reads.vulnerability_id BETWEEN 100050 
          AND 100100 
          AND vulnerability_reads.vulnerability_id >= 100050 
          AND vulnerability_reads.vulnerability_id < 100100
      )
  ) selected_ids 
WHERE 
  vr.id = selected_ids.id
 ModifyTable on public.vulnerability_reads vr  (cost=3.38..319.81 rows=0 width=0) (actual time=210.679..210.684 rows=0 loops=1)
   Buffers: shared hit=774 read=71 dirtied=3
   I/O Timings: read=207.446 write=0.000
   ->  Nested Loop  (cost=3.38..319.81 rows=1 width=38) (actual time=210.676..210.680 rows=0 loops=1)
         Buffers: shared hit=774 read=71 dirtied=3
         I/O Timings: read=207.446 write=0.000
         ->  Nested Loop  (cost=2.81..319.09 rows=1 width=67) (actual time=210.675..210.678 rows=0 loops=1)
               Buffers: shared hit=774 read=71 dirtied=3
               I/O Timings: read=207.446 write=0.000
               ->  Nested Loop  (cost=2.27..277.25 rows=74 width=48) (actual time=44.453..176.757 rows=42 loops=1)
                     Buffers: shared hit=619 read=58 dirtied=3
                     I/O Timings: read=174.232 write=0.000
                     ->  Nested Loop  (cost=1.70..244.24 rows=43 width=42) (actual time=34.308..124.971 rows=42 loops=1)
                           Buffers: shared hit=425 read=42 dirtied=3
                           I/O Timings: read=123.277 write=0.000
                           ->  Nested Loop  (cost=1.14..206.11 rows=43 width=36) (actual time=19.920..83.524 rows=42 loops=1)
                                 Buffers: shared hit=227 read=30 dirtied=3
                                 I/O Timings: read=82.404 write=0.000
                                 ->  Index Scan using index_vulnerability_reads_on_vulnerability_id on public.vulnerability_reads  (cost=0.57..51.95 rows=43 width=14) (actual time=16.023..79.078 rows=42 loops=1)
                                       Index Cond: ((vulnerability_reads.vulnerability_id >= 100050) AND (vulnerability_reads.vulnerability_id <= 100100) AND (vulnerability_reads.vulnerability_id >= 100050) AND (vulnerability_reads.vulnerability_id < 100100))
                                       Buffers: shared hit=18 read=29 dirtied=3
                                       I/O Timings: read=78.570 write=0.000
                                 ->  Index Scan using vulnerability_reads_pkey on public.vulnerability_reads vr_1  (cost=0.57..3.58 rows=1 width=22) (actual time=0.102..0.102 rows=1 loops=42)
                                       Index Cond: (vr_1.id = vulnerability_reads.id)
                                       Buffers: shared hit=209 read=1
                                       I/O Timings: read=3.834 write=0.000
                           ->  Index Scan using index_vulnerability_occurrences_on_vulnerability_id on public.vulnerability_occurrences vo  (cost=0.57..0.88 rows=1 width=22) (actual time=0.984..0.985 rows=1 loops=42)
                                 Index Cond: (vo.vulnerability_id = vr_1.vulnerability_id)
                                 Buffers: shared hit=198 read=12
                                 I/O Timings: read=40.873 write=0.000
                     ->  Index Scan using index_vulnerability_occurrence_identifiers_on_unique_keys on public.vulnerability_occurrence_identifiers voi  (cost=0.57..0.71 rows=6 width=22) (actual time=1.230..1.231 rows=1 loops=42)
                           Index Cond: (voi.occurrence_id = vo.id)
                           Buffers: shared hit=194 read=16
                           I/O Timings: read=50.955 write=0.000
               ->  Index Scan using vulnerability_identifiers_pkey on public.vulnerability_identifiers vi  (cost=0.54..0.57 rows=1 width=35) (actual time=0.805..0.805 rows=0 loops=42)
                     Index Cond: (vi.id = voi.identifier_id)
                     Filter: (((vi.external_id)::text = ANY ('{A1:2017-Injection,A1:2017,"A2:2017-Broken Authentication",A2:2017,"A3:2017-Sensitive Data Exposure",A3:2017,"A4:2017-XML External Entities (XXE)",A4:2017,"A5:2017-Broken Access Control",A5:2017,"A6:2017-Security Misconfiguration",A6:2017,"A7:2017-Cross-Site Scripting (XSS)",A7:2017,"A8:2017-Insecure Deserialization",A8:2017,"A9:2017-Using Components with Known Vulnerabilities",A9:2017,"A10:2017-Insufficient Logging & Monitoring",A10:2017,"A1:2021-Broken Access Control",A1:2021,"A2:2021-Cryptographic Failures",A2:2021,A3:2021-Injection,A3:2021,"A4:2021-Insecure Design",A4:2021,"A5:2021-Security Misconfiguration",A5:2021,"A6:2021-Vulnerable and Outdated Components",A6:2021,"A7:2021-Identification and Authentication Failures",A7:2021,"A8:2021-Software and Data Integrity Failures",A8:2021,"A9:2021-Security Logging and Monitoring Failures",A9:2021,"A10:2021-Server-Side Request Forgery",A10:2021}'::text[])) AND (lower((vi.external_type)::text) = 'owasp'::text))
                     Rows Removed by Filter: 1
                     Buffers: shared hit=155 read=13
                     I/O Timings: read=33.214 write=0.000
         ->  Index Scan using vulnerability_reads_pkey on public.vulnerability_reads vr  (cost=0.57..0.62 rows=1 width=16) (actual time=0.000..0.000 rows=0 loops=0)
               Index Cond: (vr.id = vr_1.id)
               I/O Timings: read=0.000 write=0.000
Time: 225.270 ms
  - planning: 14.283 ms
  - execution: 210.987 ms
    - I/O read: 207.446 ms
    - I/O write: 0.000 ms

Shared buffers:
  - hits: 774 (~6.00 MiB) from the buffer pool
  - reads: 71 (~568.00 KiB) from the OS file cache, including disk I/O
  - dirtied: 3 (~24.00 KiB)
  - writes: 0

https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/25017/commands/79523

Estimated Time to complete: 1 week, 4 days, 19 hours, and 16 minutes
  • Interval: 120s
  • Total tuple count: 84984390
  • Max batch size: 0
  • Estimated seconds to complete: 1019760s
  • Estimated number of batches: 8498
  • Average batch time: 53.83s
  • Batch size: 10000
  • N. of batches sampled: 33
  • N. of failed batches: 0

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #419092 (closed)

Edited by Bala Kumar

Merge request reports