Skip to content

Optimise getting the pipeline status of commits

Yorick Peterse requested to merge ci-pipeline-status-query into master

This optimises how we get the CI pipeline status of commits. Currently this optimisation is only applied on the dashboard (when showing the latest commit) and on the project commits page.

This works by getting the pipeline status for not only a ref but also for a limited number of SHAs, reducing the number of rows we have to process.

See https://gitlab.com/gitlab-org/gitlab-ce/issues/36878 for more information.

Query Example

An example of the new query (as produced by Ci::Pipeline.latest_status_per_commit) is as follows:

SELECT ci_pipelines.sha, ci_pipelines.status
FROM ci_pipelines
LEFT OUTER JOIN ci_pipelines ci_pipelines_2
ON ci_pipelines.sha = ci_pipelines_2.sha
AND ci_pipelines.project_id = ci_pipelines_2.project_id
AND ci_pipelines.id < ci_pipelines_2.id
WHERE ci_pipelines.project_id = 13083
AND ci_pipelines.sha IN (
    '185707fb58fba489af3aef011b4eed6d81f4902f',
    '22c66789f62d07cc6f729320de2f7fef0b29a414',
    '2e13dafbe822077280a5e30d9807963af457e20e',
    '32c2688e8284760a20b3bdb7d645c978c6fc9268',
    '3c12105a3bf46181ae61d5052b407966d638f072',
    '3d24c01a1925475b35d072e0cad971939a9e8564',
    '419aac178a559c2cf3b86fe681df79e4cf3583b1',
    '45e1939e190ef6a173fa5fd4f9a20f99f3a839d0',
    '6510d96c42c8db534615175abbb6a765710c9d25',
    '9b1e9da986267e3d5c910ca523d67f40be7b4878',
    'ade23934a17daac397beadc0b7346440cba41926',
    'bf3a8f6463451928a3efc5e6fba2295fcb6b41b2',
    'c608c25003906da8c934b99fc0d1bdf6f8e21c90',
    'cf6ff601656b4de368cc9fe17543e8d611368a46',
    'f63178b598b839b72af57a025eee6d50c21794e8'
)
AND ci_pipelines_2.id IS NULL;

On production this will produce the following plan (as per EXPLAIN (BUFFERS, ANALYZE)):

 Nested Loop Anti Join  (cost=1.12..49.18 rows=1 width=48) (actual time=0.037..0.304 rows=15 loops=1)
   Buffers: shared hit=150
   ->  Index Scan using index_ci_pipelines_on_project_id_and_sha on ci_pipelines  (cost=0.56..40.56 rows=1 width=56) (actual time=0.025..0.218 rows=15 loops=1)
         Index Cond: ((project_id = 13083) AND ((sha)::text = ANY ('{185707fb58fba489af3aef011b4eed6d81f4902f,22c66789f62d07cc6f729320de2f7fef0b29a414,2e13dafbe822077280a5e30d9807963af457e20e,32c2688e8284760a20b3bdb7d645c978c6fc9268,3c12105a3bf46181ae61d5052b407966d638f072,3d24c01a1925475b35d072e0cad971939a9e8564,419aac178a559c2cf3b86fe681df79e4cf3583b1,45e1939e190ef6a173fa5fd4f9a20f99f3a839d0,6510d96c42c8db534615175abbb6a765710c9d25,9b1e9da986267e3d5c910ca523d67f40be7b4878,ade23934a17daac397beadc0b7346440cba41926,bf3a8f6463451928a3efc5e6fba2295fcb6b41b2,c608c25003906da8c934b99fc0d1bdf6f8e21c90,cf6ff601656b4de368cc9fe17543e8d611368a46,f63178b598b839b72af57a025eee6d50c21794e8}'::text[])))
         Buffers: shared hit=75
   ->  Index Scan using index_ci_pipelines_on_project_id_and_sha on ci_pipelines ci_pipelines_2  (cost=0.56..4.59 rows=1 width=49) (actual time=0.005..0.005 rows=0 loops=15)
         Index Cond: ((ci_pipelines.project_id = project_id) AND (project_id = 13083) AND ((ci_pipelines.sha)::text = (sha)::text))
         Filter: (ci_pipelines.id < id)
         Rows Removed by Filter: 1
         Buffers: shared hit=75
 Planning time: 0.535 ms
 Execution time: 0.365 ms

Performance Difference

The old approach would require 1 query per commit, though technically it might have been possible to use a single query for multiple commits. Regardless even the performance for getting the data of a single commit was really bad.

In the best case it would take around 20 milliseconds per commit to get the status, meaning we'd need 390 milliseconds just to get the status of all commits on the commit lists (https://gitlab.com/gitlab-org/gitlab-ce/commits/master) page. To measure this I ran the following benchmark on production:

shas = [
  '185707fb58fba489af3aef011b4eed6d81f4902f',
  '22c66789f62d07cc6f729320de2f7fef0b29a414',
  '2e13dafbe822077280a5e30d9807963af457e20e',
  '32c2688e8284760a20b3bdb7d645c978c6fc9268',
  '3c12105a3bf46181ae61d5052b407966d638f072',
  '3d24c01a1925475b35d072e0cad971939a9e8564',
  '419aac178a559c2cf3b86fe681df79e4cf3583b1',
  '45e1939e190ef6a173fa5fd4f9a20f99f3a839d0',
  '6510d96c42c8db534615175abbb6a765710c9d25',
  '9b1e9da986267e3d5c910ca523d67f40be7b4878',
  'ade23934a17daac397beadc0b7346440cba41926',
  'bf3a8f6463451928a3efc5e6fba2295fcb6b41b2',
  'c608c25003906da8c934b99fc0d1bdf6f8e21c90',
  'cf6ff601656b4de368cc9fe17543e8d611368a46',
  'f63178b598b839b72af57a025eee6d50c21794e8'
]

project = Project.find(13083)
commits = shas.map { |sha| project.commit(sha) }
timing = Benchmark.measure { commits.each(&:status) }.real * 1000.0

puts "Time for #{shas.length} commits: #{timing} ms"
puts "Average: #{(timing / shas.length).round(2)} ms"

This outputs:

Time for 15 commits: 241.09070608392358 ms
Average: 16.07 ms

This new approach introduces two improvements:

  1. The new query is faster.
  2. The new query allows checking the status of multiple commits, meaning we only need 1 SQL query for all our commits to get our data instead of 1 query for every commit.

If we were to re-run the above benchmark using the new code (without performing any eager loading of the commit statuses) the output would be:

Time for 15 commits: 44.11310190334916 ms
Average: 2.94 ms

If we eager load the commit statuses (as is done on project dashboards and the commits list page) the output (including the time to query the status for all commits) instead is:

Time for 15 commits: 9.885445702821016 ms
Average: 0.66 ms

That's a 24.5x improvement over the old code.

Database Checklist

When adding or modifying queries to improve performance:

  • Included the raw SQL queries of the relevant queries
  • Included the output of EXPLAIN ANALYZE and execution timings of the relevant queries
  • Added tests for the relevant changes

General Checklist

Edited by Yorick Peterse

Merge request reports