Skip to content
Snippets Groups Projects
Unverified Commit 05571984 authored by Jamie Maynard's avatar Jamie Maynard
Browse files

Move infrastructure files in to place

parent 1951f234
No related branches found
No related tags found
1 merge request!1712Migrated infrastructure from www-gitlab-com to here
Showing
with 201 additions and 38 deletions
---
title: infrastructure
cascade:
type: docs
menu:
main:
name: infrastructure
pre: '<i class="fa-brands fa-gitlab"></i>'
title: "Infrastructure"
description: "The Infrastructure Department is responsible for the availability, reliability, performance, and scalability of GitLab.com and other supporting services"
---
## Mission
The Infrastructure Department enables GitLab (the company) to deliver a single DevOps application, and GitLab SaaS users to focus on generating value for their own businesses by ensuring that we operate an enterprise-grade SaaS platform.
The Infrastructure Department does this by focusing on **availability**, **reliability**, **performance**, and **scalability** efforts.
These responsibilities have cost efficiency as an additional driving force, reinforced by the properly prioritized [**dogfooding**](#dogfooding) efforts.
Many other teams also contribute to the success of the SaaS platform because [GitLab.com is not a role](/company/team/structure/#gitlabcom-isnt-a-role).
However, it is the responsibility of the Infrastructure Department to drive the ongoing evolution of the SaaS platform, enabled by platform observability data.
## Vision
The Infrastructure Department operates a fast, secure, and reliable SaaS platform to which (and with which) [everyone can contribute][contribute].
Integral part of this vision is to:
1. Build a highly performant team of engineers, combining operational and software development experience to influence the best in reliable infrastructure.
1. Work publicly in accordance with our [transparency] value.
1. [Use our own product](#dogfooding) to prepare, build, deliver work, and support [the company strategy][strategy].
1. Align our [strategy](#strategy) with the industry trends, company direction, and end customer needs.
## Direction
The direction is accomplished by using [Objectives and Key Results (OKRs)](https://about.gitlab.com/handbook/engineering/infrastructure-quality/okrs/).
Other strategic initiatives to achieve this vision are driven by the needs of enterprise customers looking to adopt GitLab.com. [The GitLab.com strategy](https://about.gitlab.com/direction/enablement/dotcom/) catalogs top customer requests for the SaaS offering and outlines strategic initiatves across both Infrastructure and Stage Groups needed to address these gaps.
<%= partial "includes/we-are-also-product-development.md" %>
## Organization structure
(click the boxes for more details)
```mermaid
flowchart LR
I[Infrastructure]
click I "/handbook/engineering/infrastructure/"
I --> TPM
I --> EP[Engineering Productivity]
click EP "/handbook/engineering/quality/engineering-productivity/"
I --> C[Core Platform]
click C "/handbook/engineering/infrastructure/core-platform/"
I --> EA[Engineering Analytics]
click EA "/handbook/engineering/quality/engineering-analytics/"
I --> TP[Test Platform]
click TP "https://about.gitlab.com/handbook/engineering/infrastructure/test-platform/"
I --> SP[SaaS Platforms]
C --> SS[Systems Stage]
click SS "/handbook/engineering/infrastructure/core-platform/systems/"
SS --> GC[Gitaly::Cluster]
click GC "/handbook/engineering/infrastructure/core-platform/systems/gitaly/"
SS --> GG[Gitaly::Git]
click GG "/handbook/engineering/infrastructure/core-platform/systems/gitaly/"
SS --> Geo
click Geo "/handbook/engineering/infrastructure/core-platform/systems/geo/"
SS --> DB[Distribution::Build]
click DB "/handbook/engineering/infrastructure/core-platform/systems/distribution/"
SS --> DD[Distribution::Deploy]
click DD "/handbook/engineering/infrastructure/core-platform/systems/distribution/"
SS --> CC[Cloud Connector]
click CC "/handbook/engineering/infrastructure/core-platform/systems/cloud-connector/"
C --> DS[Data Stores Stage]
click DS "/handbook/engineering/infrastructure/core-platform/data_stores/"
DS --> TS[Tenant Scale]
click TS "/handbook/engineering/infrastructure/core-platform/data_stores/tenant-scale/"
DS --> Database
click Database "/handbook/engineering/infrastructure/core-platform/data_stores/database/"
DS --> GS[Global Search]
click GS "/handbook/engineering/infrastructure/core-platform/data_stores/search/"
SP --> DE[Delivery]
click DE "/handbook/engineering/infrastructure/team/delivery/"
DE --> Deployments
DE --> Releases
SP --> Ops
SP --> Foundations
SP --> Scalability
click Scalability "/handbook/engineering/infrastructure/team/scalability/"
Scalability --> Observability
Scalability --> Practices
SP --> D[Dedicated]
click D "/handbook/engineering/infrastructure/team/gitlab-dedicated/"
D --> E[Environment Automation]
click E "/handbook/engineering/infrastructure/team/gitlab-dedicated/"
D --> PSS[Public Sector Services]
click PSS "/handbook/engineering/infrastructure/team/gitlab-dedicated/us-public-sector-services/"
D --> Switchboard
click Switchboard "/handbook/engineering/infrastructure/team/gitlab-dedicated/switchboard/"
TP --> TTI[Test and Tools Infrastructure]
click TTI "engineering/infrastructure/test-platform/dev-qe-team/"
TP --> SMP[Self-Managed Platform]
click SMP "/handbook/engineering/infrastructure/test-platform/enablement-saas-platforms-qe-team/"
TP --> TE[Test Engineering]
click TE "/handbook/engineering/infrastructure/test-platform/fulfillment-growth-qe-team/"
```
## Design
The [**Infrastructure Library**][library] contains documents that outline our thinking about the problems we are solving and represents the ***current state*** for any topic, playing a significant role in how we produce technical solutions to meet the challenges we face.
## Dogfooding
The Infrastructure department uses GitLab and GitLab features extensively as the main tool for operating many [environments](/handbook/engineering/infrastructure/environments/), including GitLab.com.
We follow the same [dogfooding process](/handbook/engineering/development/principles/#dogfooding) as part of the Engineering function, while keeping the [department mission statement](#mission) as the primary prioritization driver. The prioritization process is aligned to [the Engineering function level prioritization process](/handbook/engineering/#prioritizing-technical-decisions) which defines where the priority of dogfooding lies with regards to other technical decisions the Infrastructure department makes.
When we consider building tools to help us operate GitLab.com, we follow the [`5x rule`](/handbook/product/product-processes/dogfooding-for-product-mgt/#dogfooding-process) to determine whether to build the tool as a feature in GitLab or outside of GitLab. To track Infrastructure's contributions back into the GitLab product, we tag those issues with the appropriate [Dogfooding](https://gitlab.com/groups/gitlab-com/-/labels?utf8=%E2%9C%93&subscribed=&search=dogfooding) label.
## Handbook use at the Infrastructure department
At GitLab, we have a [handbook first policy](/handbook/handbook-usage/#why-handbook-first). It is how we communicate process changes, and how we build up a single source of truth for work that is being delivered every day.
The [handbook usage page guide](/handbook/handbook-usage/) lists a number of general tips. Highlighting the ones that can be encountered most frequently in the Infrastructure department:
1. The wider community can benefit from training materials, architectural diagrams, technical documentation, and how-to documentation. A good place for this detailed information is in the related project documentation. A handbook page can contain a high level overview, and link to more in-depth information placed in the project documentation.
1. Think about the audience consuming the material in the handbook. A detailed run through of a GitLab.com operational runbook in the handbook might provide information that is not applicable to self-managed users, potentially causing confusion. Additionally, the handbook is not a go-to place for operational information, and grouping operational information together in a single place while explaining the general context with links as a reference will increase visibility.
1. Ensure that the handbook pages are easy to consume. Checklists, onboarding, repeatable tasks should be either automated or created in a form of template that can be linked from the handbook.
1. The handbook is the process. The handbook describes our principles, and our epics and issues are our principles put into practice.
## Projects
Classification of the Infrastructure department projects is described on the [infrastructure department projects page](/handbook/engineering/infrastructure/projects).
The [infrastructure issue tracker](https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues) is the backlog and a catch-all project for the infrastructure teams and tracks the work our teams are doing–unrelated to an ongoing change or incident.
In addition to tracking the backlog, Infrastructure Department projects are captured in our [Infrastructure Department Epic](https://gitlab.com/groups/gitlab-com/-/epics/1049) as well as in our [Quarterly Objectives & Key Results](https://gitlab.com/groups/gitlab-com/-/epics/1420)
## Supporting Product Features
We have a model that we use to help us support product features. [This model](/handbook/engineering/infrastructure/feature-support.html) provides details on how we collaborate to ship new features to Production.
## Ownership
The Infrastructure team maintains responsibility for the underlying infrastructure on which customer-facing services run. Specific ownership details are in the [GitLab Service Ownership Policy](./service-ownership/index.html)
## Stable Counterparts
Infrastructure SREs may be aligned with [stage groups](/handbook/product/categories/#categories-a-z) as [stable counterparts](https://about.gitlab.com/blog/2018/10/16/an-ode-to-stable-counterparts/).
[Stable Counterparts](./team/stable-counterpart.html) are used as a framework for managing reliable services at GitLab. The framework provides guidelines for collaboration between [Stage Groups](/handbook/product/categories/#categories-a-z) and [Infrastructure Teams](/handbook/engineering/infrastructure-quality/#engaging-with-the-infrastructure-teams).
## Interviewing
The Infrastructure department hires for a number of different technical specialisms and positions across its teams. This [Infrastructure Interviewing Guide](https://about.gitlab.com/handbook/hiring/interviewing/infrastructure-interview/) offers more detail on some of our regular openings, interview process and other useful information related to applying to jobs with us. More information on our current openings can be found on the [careers page](https://about.gitlab.com/jobs/).
<%= partial "handbook/engineering/infrastructure/_common_links.html" %>
[library]: https://gitlab.com/gitlab-com/gl-infra/readiness/-/tree/master/library
[strategy]: /company/strategy/
[transparency]: /handbook/values/#transparency
[contribute]: /company/mission/#everyone-can-contribute
---
aliases: /handbook/engineering/infrastructure/core-platform/data_stores/database-reliability/dbre-escalation/process.html
title: DBRE Escalation Process
---
......@@ -36,7 +36,7 @@ The expectation for the DBRE engineers is to be a database consultant and collab
1. This process is **NOT** a path to reach the DBRE team for non-urgent issues that the Development, Security, and Support teams run into. Such issues can be moved forward by:
1. Labelling with `team::Database Reliability` and following the [Reliability General Workflow](/handbook/engineering/infrastructure/team/reliability/#how-we-work--general-workflow)
1. Raising to the `#g_infra_database_reliability` Slack channel, or
1. Asking the infrastructure-lounge Slack channel assigning the `@dbre` user group
1. Asking the infrastructure-lounge Slack channel assigning the `@dbre` user group
1. This process provides for Weekdays coverage only.
#### Example of qualified issue
......
---
aliases: /handbook/engineering/infrastructure/core-platform/data_stores/database/activity-log.html
title: Database Group Activity Log
---
......
---
aliases: /handbook/engineering/infrastructure/core-platform/data_stores/database/doc/citus.html
title: Sharding GitLab with CitusDB
---
......@@ -12,11 +12,11 @@ title: Sharding GitLab with CitusDB
This is a working document to outline the decision making process with respect to using CitusDB as a database sharding solution for GitLab on GitLab.com.
### Citus Community
### Citus Community
We were exploring the [Citus Community ](https://www.citusdata.com/product/community) offering as part of our efforts to [explore CitusDB as a sharding solution](https://gitlab.com/gitlab-org/gitlab/issues/207833). Citus Community is licensed under the [GNU Affero General Public License v3.0 (GNU AGPLv3)](https://github.com/citusdata/citus/blob/master/LICENSE). GNU AGPLv3 is listed in our handbook as an [unacceptable license](https://docs.gitlab.com/ee/development/licensing.html#unacceptable-licenses) requiring legal approval for use.
We were exploring the [Citus Community ](https://www.citusdata.com/product/community) offering as part of our efforts to [explore CitusDB as a sharding solution](https://gitlab.com/gitlab-org/gitlab/issues/207833). Citus Community is licensed under the [GNU Affero General Public License v3.0 (GNU AGPLv3)](https://github.com/citusdata/citus/blob/master/LICENSE). GNU AGPLv3 is listed in our handbook as an [unacceptable license](https://docs.gitlab.com/ee/development/licensing.html#unacceptable-licenses) requiring legal approval for use.
On April 15, 2020 we sought advice from our legal counsel. There were enough questions and concerns about the GNU AGPLv3 license that we decided to discontinue our usage research into Citus Community. *Notes and agenda can be found [here](https://docs.google.com/document/d/1wzcpd10uOgY41fub8HZBN0E5VusrRKIgWiS9X-klJpY/edit)(only accessible to GitLab team-members).*
### Citus Enterprise
April 20, 2020 - Due to licensing costs we have decided not to pursue Citus Enterprise for a sharding solution. We will focus our efforts on PostgreSQL Partitioning with foreign data wrappers (FDW).
\ No newline at end of file
April 20, 2020 - Due to licensing costs we have decided not to pursue Citus Enterprise for a sharding solution. We will focus our efforts on PostgreSQL Partitioning with foreign data wrappers (FDW).
---
aliases: /handbook/engineering/infrastructure/core-platform/data_stores/database/doc/container-registry.html
title: Container Registry on PostgreSQL
---
......@@ -36,7 +36,7 @@ The "dedup factor" is basically how much more entries we expect if we didn't ded
| Blob | Layer | 1.53 |
| Blob | Repository | 1.17 |
Example for Manifest:
Example for Manifest:
```sql
select (select count(*) from repository_manifests) / (select count(*)::numeric from manifests);
......@@ -119,7 +119,7 @@ CREATE FUNCTION public.track_blob_usage() RETURNS trigger
AS $$
BEGIN
IF (TG_OP = 'DELETE') THEN
-- TODO: We can do more stuff here, this is just for illustrative purposes.
-- TODO: We can do more stuff here, this is just for illustrative purposes.
-- Note: This doesn't have to be a trigger, it can also be application logic
IF (SELECT COUNT(*) FROM blobs_layers WHERE id <> OLD.id AND digest = OLD.digest AND layer_id IS NOT NULL) = 0 THEN
INSERT INTO blob_review_queue (digest) VALUES (OLD.digest) ON CONFLICT (digest) DO NOTHING;
......@@ -156,7 +156,7 @@ For example, when we delete a layer - we can determine the affected blobs easily
###### Benefit 2: No GC needed for entities other than blobs
Blob management has a need for (some) GC algorithm, because we effectively deduplicate data in object storage. However, other entities like manifests and layers don't have a need to perform GC in this model.
Blob management has a need for (some) GC algorithm, because we effectively deduplicate data in object storage. However, other entities like manifests and layers don't have a need to perform GC in this model.
This is in contrast to model 1 where we effectively allow a record to become "dangling" because we deduplicate all entities in the database, too.
......
---
aliases: /handbook/engineering/infrastructure/core-platform/data_stores/database/doc/fdw-sharding.html
title: PostgreSQL 11 sharding with foreign data wrappers and partitioning
---
......@@ -75,7 +75,7 @@ We can continue to work with the `issues` table as usual. If we added more immut
### Schema migrations
In order to change the existing schema, we've discussed two examples: [Adding](https://gitlab.com/gitlab-org/gitlab/-/tree/611126f9ea4f27be3e597aa2384a801319db1793/db/sharding/add_column) and [dropping](https://gitlab.com/gitlab-org/gitlab/-/tree/611126f9ea4f27be3e597aa2384a801319db1793/db/sharding/drop_column) a column.
In order to change the existing schema, we've discussed two examples: [Adding](https://gitlab.com/gitlab-org/gitlab/-/tree/611126f9ea4f27be3e597aa2384a801319db1793/db/sharding/add_column) and [dropping](https://gitlab.com/gitlab-org/gitlab/-/tree/611126f9ea4f27be3e597aa2384a801319db1793/db/sharding/drop_column) a column.
<figure class="video_container">
<iframe src="https://www.youtube.com/watch?v=nt4Khi9Gr3o" frameborder="0" allowfullscreen="true"> </iframe>
......@@ -123,7 +123,7 @@ However, this setup comes with a lot of complexity and limitations compared to a
10. Updates to reference tables still need to go to the main cluster creating a bottleneck for write-scalability and a single-point of failure.
11. No global transaction management.
In order to really benefit from this approach, we'd have to shard relevant tables by the same dimension and make it possible to execute queries directly on shards.
In order to really benefit from this approach, we'd have to shard relevant tables by the same dimension and make it possible to execute queries directly on shards.
With a lot of different access patterns in GitLab, this seems not feasible before we agree on an application-wide sharding key and deal with conflicting access patterns (e.g. by means of service extraction or isolation).
......
---
aliases: /handbook/engineering/infrastructure/core-platform/data_stores/database/doc/gitlab-com-database.html
title: Working with the GitLab.com database for developers
---
......
---
aliases: /handbook/engineering/infrastructure/core-platform/data_stores/database/doc/issue-group-search-partitioning.html
title: Partitioning - Issue group search
---
......@@ -64,7 +64,7 @@ The results are:
### Table and index sizes
As expected, individual [sizes of partitions and their indexes](https://gitlab.com/gitlab-org/gitlab/-/issues/201871#note_301733476) are greatly reduced. Notably, [indexes on partitions are between 10x and 100x smaller](https://gitlab.com/gitlab-org/database-team/team-tasks/snippets/1947682#table-and-index-sizes) than on the non-partitioned table. This is where the main benefit from partitioning comes from: If we're able to statically exclude most of the partitions for a query, we're only going to deal with small indexes and small tables.
As expected, individual [sizes of partitions and their indexes](https://gitlab.com/gitlab-org/gitlab/-/issues/201871#note_301733476) are greatly reduced. Notably, [indexes on partitions are between 10x and 100x smaller](https://gitlab.com/gitlab-org/database-team/team-tasks/snippets/1947682#table-and-index-sizes) than on the non-partitioned table. This is where the main benefit from partitioning comes from: If we're able to statically exclude most of the partitions for a query, we're only going to deal with small indexes and small tables.
### Planning times
......@@ -79,21 +79,21 @@ Furthermore we're looking at two cases here:
1. Query with partitioning key: `WHERE project_id = ? AND root_namespace_id = ?`
2. Query without partitioning key: `WHERE project_id = ?`
The second case obviously doesn't benefit from partitioning during execution as it scans all partitions.
The second case obviously doesn't benefit from partitioning during execution as it scans all partitions.
Starting with the simple query, this shows a rather expected result. As we can see the planning time depends on the number of attached partitions and increases slightly the more partitions we attach. The first attempt shows elevated planning times due to a cold cache (table metadata, statistics).
![simple-query-stats1000](issue-group-search-partitioning/simple-query-stats1000.png)
![simple-query-stats1000](simple-query-stats1000.png)
In the second example, we employ the same analysis but look at the complex issue group search example. This yielded quite unexpected planning times for the case without a partitioning key. This would drastically harm queries that don't have a partitioning key as a filter.
![simple-query-stats1000](issue-group-search-partitioning/complex-query-stats1000.png)
![simple-query-stats1000](complex-query-stats1000.png)
We suspected this might be due to gathering rather large statistics. On GitLab.com, we currently have `default_statistics_target = 1000` which is 10x the default postgres setting. It directly controls the amount of detail the table histograms are going to have and therefore has a direct impact on the data that is relevant for query planning.
After dialing this down to `default_statistics_target = 100` (the default setting), we arrive at more reasonable query timings. Luckily, this setting can be controlled on a per-table basis as well.
![simple-query-stats1000](issue-group-search-partitioning/complex-query-stats100.png)
![simple-query-stats1000](complex-query-stats100.png)
All data for these graphs can be found in a [public sheet](https://docs.google.com/spreadsheets/d/1MUc-Ogkal5XI2KKSeyn8m3nbdXuzNZ_h1-AHK0Ql3JE/edit?usp=sharing).
......@@ -101,7 +101,7 @@ All data for these graphs can be found in a [public sheet](https://docs.google.c
1. Table partitioning helps to reduce execution times as expected.
2. It increases planning times for cases where most partitions get pruned. This is expected too and pays off when the subsequent execution time can be drastically reduced.
3. We need to be careful with queries that don't have a partitioning key.
3. We need to be careful with queries that don't have a partitioning key.
##### Queries without a partitioning key
......@@ -117,7 +117,7 @@ This means we should make sure that - for a partitioned table - as many queries
2. We may want to find mechanics to detect and find queries early in the development cycle (i.e. on CI) that don't contain the partitioning key.
3. There are going to be cases where we can't use a partitioning key at all.
For example, finding issues assigned to a user doesn't come with a notion of a namespace but rather from a user's perspective. We will have to identify these cases and resolve them, for example by extracting their features into a separate service or duplicating data internally (e.g. the assignments table, to support both perspectives efficiently) and in some cases it might be possible to accept the increase in planning time.
For example, finding issues assigned to a user doesn't come with a notion of a namespace but rather from a user's perspective. We will have to identify these cases and resolve them, for example by extracting their features into a separate service or duplicating data internally (e.g. the assignments table, to support both perspectives efficiently) and in some cases it might be possible to accept the increase in planning time.
### Working page for PostgreSQL table partitioning
......
---
aliases: /handbook/engineering/infrastructure/core-platform/data_stores/database/doc/lexicon.html
title: Database Lexicon - terms and definitions relating to our database
description: "This is not a comprehensive list of all of the commonly used terms, but rather it is a list of terms that are commonly confused or conflated with other terms. In each section we will identify common phrases, define our specific usage and list external references for the term in question."
---
......@@ -37,4 +37,4 @@ Foreign Data Wrappers allow for accessing remote data from multiple data sources
### Partitioning with FDW = Sharding
With PostgreSQL, partitioning is the first implementation step taken on the path to sharding using Foreign Data Wrappers (FDW). More information and implementation details can be found in the Database Team Handbook Entry: [PostgreSQL 11 sharding with foreign data wrappers and partitioning](/handbook/engineering/infrastructure/core-platform/data_stores/database/doc/fdw-sharding.html)
\ No newline at end of file
With PostgreSQL, partitioning is the first implementation step taken on the path to sharding using Foreign Data Wrappers (FDW). More information and implementation details can be found in the Database Team Handbook Entry: [PostgreSQL 11 sharding with foreign data wrappers and partitioning](/handbook/engineering/infrastructure/core-platform/data_stores/database/doc/fdw-sharding.html)
---
aliases: /handbook/engineering/infrastructure/core-platform/data_stores/database/doc/multidb-bg-migrations.html
title: Multi-database Background migrations
---
......
---
aliases: /handbook/engineering/infrastructure/core-platform/data_stores/database/doc/partitioning.html
title: Database Partitioning
---
......@@ -22,7 +22,7 @@ We can see here that there are individual tables larger than 100 GB, some even g
There is one notable table that we can ignore in this discussion: `merge_request_diff_files` which is the largest table and makes up more than 30% of the total database size today. This is being addressed by [externalizing this data](https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7356) and move it into cloud storage.
The other tables shown here is where GitLab's most used features have their data:
The other tables shown here is where GitLab's most used features have their data:
* CI: Build jobs, trace sections, artifacts and pipelines
* Merge requests: Descriptions, commits and notes
......@@ -108,7 +108,7 @@ We analyzed how issues distribute across partitions using a hash-based partition
### Implementation roadmap
This section discusses several topics we already identified that have to be tackled along the way.
This section discusses several topics we already identified that have to be tackled along the way.
##### Support for advanced PostgreSQL features
......@@ -174,4 +174,4 @@ Hash-based partitioning is a static approach in context of partition creation an
------
Author: [Andreas Brandl](https://gitlab.com/abrandl)
\ No newline at end of file
Author: [Andreas Brandl](https://gitlab.com/abrandl)
---
aliases: /handbook/engineering/infrastructure/core-platform/data_stores/database/doc/root-namespace-sharding.html
title: Sharding GitLab by top-level namespace
---
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment