Direct relationship between Pipeline Job and Environment
Problem
GitLab has many features that fetch related environments to a particular object, and one of the biggest users is CI/CD Pipelines. Specifically, Ci::Build
is tightly related to Environment
model. For example, when you visit a job detail page, it renders an information that the job will deploy to which environment. The other example is, when rendering play button on a manual job, it performs authorization check if the user has access to the target environment (for the ProtectedEnvironment
optimization, please see this issue). Whatever the process is, we fetch the related environment in the following way:
build.persisted_environment
The internal process flow is:
- Fetch a corresponding
ci_build_metadata
row. - Read
ci_build_metadata.expanded_environment_name
attribute. - Fetch a matching
environments
row for the environment name. - Return the AR object.
Given that it lacks of direct relationship between Ci::Build
and Environment
, it has to execute two queries in this simple process. But, the biggest problem on this architecture is that we can't preload the associated environments for multiple builds in batch/single-query. Using a temporarily solution like BatchLoader
(Lazy loader) or Gitlab::SafeRequestStore
(short-time caching) might be able to mitigate the issue a little bit, however, it's a fragile approach that likely requires maintenance effort in the future.
Technically, the related environment can be fetched via Deployment
model, however, not all jobs are meant to deploy. Some of them are stopping an environment, or just preparing artifacts for environments. In such case, deployment modeling relationship is insufficient to cover all related environments.
This is a long standing issue. In the past, this problematic architecture caused performance/scalability issues time to time, and every time we deferred the optimal solution due to lacking of capacity. Here are a few of the recent discussion with groupmemory team. We should fix the architectural problem at first in order to reduce the feature maintenance cost.
Ci::Build::Environment
model
Proposal: Introduce Currently, we don't persist any information about "which build interacts/interacted with which environment".
What's persisted in ci_builds
table is just a metadata of the environment:
keyword from .gitlab-ci.yml,
and this requires the system to inflater the actual relationship data every time.
We should have the dedicated model for managing these information, and
moves away all environments related stuff to the new place from Ci::Build
.
Database table
ci_build_environments
- build_id: (FK, NOT NULL)
- environment_id: (FK, NOT NULL)
- name: (string, expanded name)
- options: (jsonb, all `environments` hash from .gitlab-ci.yml)
Unique index on build_id, environment_id
We also move all environments related methods to the new place e.g. starts_environment?
. We'll try to remove Ci::Build#persisted_environment
this time (or maybe after the transition period).
This also address the scalability problem that ci_builds.options
column is silently growing as per we're adding new keywords to .gitlab-ci.yml. It's better to decouple the environment specific part to the other table. See slack discussion => https://gitlab.slack.com/archives/C0SFP840G/p1614929838210400?thread_ts=1614845826.189300&cid=C0SFP840G. (we still may leave ci_builds.environment
column as-is because it can check the existence of environment relationship without any cross-join)
Examples:
Model Relationship
class Environment
has_many :build_environments
has_many :builds, through: :build_environments
has_many :deployments
end
class Deployment
belongs_to :environment
end
class Ci::Build
has_one :build_environments
has_one :enviroment, through: :build_environments
end
class Ci::Pipeline
has_many :enviroments, through: :builds
end
Preload in serializer
app/serializers/pipeline_serializer.rb
def preloaded_relations
[
:user,
{
manual_actions: :environment,
scheduled_actions: :environment,