Seed dev with production like data by default
What
Use postgres generate_series
as described in https://gitlab.com/gitlab-org/gitlab-ce/snippets/33946#note_23319454 to insert millions of rows into dev databases during initial setup.
Why
The performance of dev instances is so far off of production that it makes it hard to properly consider performance. Having production-like data on dev isn't as good as the real thing, but allows for much quicker iteration on performance problems and makes some types of issue less likely to slip through.
Considerations
- Using
generate_series
via the active_record-pg_generate_series gem is an order of magnitude faster than bulk_insert, which in turn is an order of magnitude faster than doing this in rails. - Data must be similar enough to production that it forces similar query plans.
-
https://gitlab.com/gitlab-com/infrastructure/issues/44#note_20352749 "For example having 2 million rows made little difference to my query until I set changed a number of the projects to
Gitlab::VisibilityLevel::PUBLIC
." - https://gitlab.com/gitlab-com/infrastructure/issues/44#note_20537425 "Of course, it is also difficult to create a real-life data distribution in a seeded database. There will always be lots of outliers that may be difficult to reproduce."
-
https://gitlab.com/gitlab-com/infrastructure/issues/44#note_20352749 "For example having 2 million rows made little difference to my query until I set changed a number of the projects to
- Seeds must be kept up to date
- As creating rows with
generate_series
bypasses rails, relations likeProjectFeature
do not get created. - Which tables do we need to seed? How many rows are there in production? Which values do typical queries check for?
- While some things will be improved just by having data in the tables, other things will need related data to actually be used. E.g. having a large number of issues with labels in the projects we actually have open, and not viewing things as admin