Seed dev with production like data by default

What

Use postgres generate_series as described in https://gitlab.com/gitlab-org/gitlab-ce/snippets/33946#note_23319454 to insert millions of rows into dev databases during initial setup.

Why

The performance of dev instances is so far off of production that it makes it hard to properly consider performance. Having production-like data on dev isn't as good as the real thing, but allows for much quicker iteration on performance problems and makes some types of issue less likely to slip through.

Considerations

  • Using generate_series via the active_record-pg_generate_series gem is an order of magnitude faster than bulk_insert, which in turn is an order of magnitude faster than doing this in rails.
  • Data must be similar enough to production that it forces similar query plans.
    • https://gitlab.com/gitlab-com/infrastructure/issues/44#note_20352749 "For example having 2 million rows made little difference to my query until I set changed a number of the projects to Gitlab::VisibilityLevel::PUBLIC."
    • https://gitlab.com/gitlab-com/infrastructure/issues/44#note_20537425 "Of course, it is also difficult to create a real-life data distribution in a seeded database. There will always be lots of outliers that may be difficult to reproduce."
  • Seeds must be kept up to date
  • As creating rows with generate_series bypasses rails, relations like ProjectFeature do not get created.
  • Which tables do we need to seed? How many rows are there in production? Which values do typical queries check for?
  • While some things will be improved just by having data in the tables, other things will need related data to actually be used. E.g. having a large number of issues with labels in the projects we actually have open, and not viewing things as admin

Related

  • Snippet used to slow dev down ($33946)
  • Sanitized snapshot of production database (gitlab-com/infrastructure#44)
  • Randomly delay SQL queries & git operations (#25185)
  • Cross functional teams so performance is considered during feature development (#27842)
Edited Dec 18, 2023 by Chad Woolley
Assignee Loading
Time tracking Loading