The performance of dev instances is so far off of production that it makes it hard to properly consider performance. Having production-like data on dev isn't as good as the real thing, but allows for much quicker iteration on performance problems and makes some types of issue less likely to slip through.
As creating rows with generate_series bypasses rails, relations like ProjectFeature do not get created.
Which tables do we need to seed? How many rows are there in production? Which values do typical queries check for?
While some things will be improved just by having data in the tables, other things will need related data to actually be used. E.g. having a large number of issues with labels in the projects we actually have open, and not viewing things as admin
This took approximately 2m7s to seed 1.5m projects, including setting default values like created_at.
# Disable database insertion logs so speed isn't limited by ability to print to consoleold_logger=ActiveRecord::Base.loggerActiveRecord::Base.logger=nilauthor=FactoryGirl.create(:user)Project.insert_using_generate_series(1,1500000)do|sql|project_name=raw("'seed_project_' || seq")#raw("md5(random()::text)")sql.name=project_namesql.path=project_namesql.creator_id=author.idsql.namespace_id=author.namespace_idend# Force a different/slower query plan by updating project visibilityProject.where(visibility_level: Gitlab::VisibilityLevel::PRIVATE).limit(200000).update_all(visibility_level: Gitlab::VisibilityLevel::PUBLIC)Project.where(visibility_level: Gitlab::VisibilityLevel::PRIVATE).limit(20000).update_all(visibility_level: Gitlab::VisibilityLevel::INTERNAL)# Reset loggingActiveRecord::Base.logger=old_logger
Of course, it is also difficult to create a real-life data distribution in a seeded database. There will always be lots of outliers that may be difficult to reproduce
Agreed, but we should do the best we can now, and then improve when we find these outliers.
@omame can we use this w.r.t. spinning up production-like topologies / environments in a separate Azure account? If I understand correctly, @rymai 's team is working on this for dev, but once we have it, no reason not to use it more widely, right?
@ernstvn My plan, if possible, is to use db snapshots for the environments creation. This way we'd have all production data available for testing. This is already in our pipeline, as I explained to you in a call last week.
@zj suggested we could expand on this by uploading to seeded database to a shared location. This would make it easy to setup/reset the database and allow us seed things the slow way with models so we have relations between tables.
Steps:
Every Sunday seed whole gdk/database from scratch
Seed thousands of most models with SeedFu
Seed millions of projects/issues/etc with generate_series
Generate backup as tar
Store the backup in S3 as we need a reliable and predictable URL
But that's really slow, so sharing that would be nice. (It's hard to use the same techniques above because MRs connect to the git repo, but if the repo is known, we could just generate the data once.)
@meks mentioned this in our Epic - we need to take it into account in our future plans. Currently there is just work going on related to canary. I would hope we work on this in Q4
I'll take a stab into https://gitlab.com/gitlab-org/gitlab-ce/compare/master...28149-improve-seed. I wonder if we could start small (with just projects mass insertion, and users for instance) to serve as an example for further extension of the seeder. That way we don't take a huge compromise upfront and people can add extra seeders when needed. Thoughts @rymai?
Currently, seeding in general is very slow (mainly labels insertion), so I believe we could make use of mass insertion strategies to make things faster as well.
@oswaldo That sounds good to me, and I think that was @jamedjo original plan for which I'm totally supportive (before advanced/complex strategies were suggested).
GitLab is moving all development for both GitLab Community Edition
and Enterprise Edition into a single codebase. The current
gitlab-ce repository will become a read-only mirror, without any
proprietary code. All development is moved to the current
gitlab-ee repository, which we will rename to just gitlab in the
coming weeks. As part of this migration, issues will be moved to the
current gitlab-ee project.
If you have any questions about all of this, please ask them in our
dedicated FAQ issue.
Using "gitlab" and "gitlab-ce" would be confusing, so we decided to
rename gitlab-ce to gitlab-foss to make the purpose of this FOSS
repository more clear
I created a merge requests for CE, and this got closed. What do I
need to do?
Everything in the ee/ directory is proprietary. Everything else is
free and open source software. If your merge request does not change
anything in the ee/ directory, the process of contributing changes
is the same as when using the gitlab-ce repository.
Will you accept merge requests on the gitlab-ce/gitlab-foss project
after it has been renamed?
No. Merge requests submitted to this project will be closed automatically.
Will I still be able to view old issues and merge requests in
gitlab-ce/gitlab-foss?
Yes.
How will this affect users of GitLab CE using Omnibus?
No changes will be necessary, as the packages built remain the same.
How will this affect users of GitLab CE that build from source?
Once the project has been renamed, you will need to change your Git
remotes to use this new URL. GitLab will take care of redirecting Git
operations so there is no hard deadline, but we recommend doing this
as soon as the projects have been renamed.
Where can I see a timeline of the remaining steps?