[Meta] - Getting to canary deployment, review apps, autoscale, life, the universe and everything else

added deploy label

@pcarranza Thanks for starting to map this out.

Can you tie other issues to this one, and make an issue board for it? ;-)

How does omnibus-gitlab fit into this picture? For the first iteration, are we just going to take the giant Docker image and connect Redis/PostgreSQL separately as we do now?

@ernstvn @stanhu -> TBC -> To be continued/concluded

It's a WIP, let me get to it.

mentioned in issue #1361 (closed)

changed title from WIP: [Meta] - Getting to canary deployment, review apps, autoscale, life, the universe and everything else to [Meta] - Getting to canary deployment, review apps, autoscale, life, the universe and everything else

mentioned in issue #1210 (closed)

mentioned in issue #811 (closed)

mentioned in issue #1684 (closed)

Related issue about running QA tests directly from Omnibus gitlab-org/omnibus-gitlab#2297 (closed).

mentioned in issue #1734 (closed)

mentioned in issue #346 (closed)

mentioned in issue #1815 (closed)

changed the description

marked the checklist item Provision front end hosts with terraform and chef(#1738 (closed)) as completed

mentioned in issue #1265 (moved)

I think we as a company should be using the same general deployment methods and build artifacts as our customers. If we can get GitLab.com deploying successfully and reliably using our official methods and tools, then we know our customers can too and at scale.

Can we explore what production needs to get there? Is this the right issue to do so?

We have a key objective to deliver a cloud native deployment method, and so would like to incorporate your requirements and insights into that process. How can we drink our own champagne here?

@joshlambert

I think we as a company should be using the same general deployment methods and build artifacts as our customers. If we can get GitLab.com deploying successfully and reliably using our official methods and tools, then we know our customers can too and at scale.

We are doing this already with the omnibus package.

The caveat is that at our current scale the omnibus image is a huge monolith that makes everything much slower and error prone.

Can we explore what production needs to get there?

As said, we are using our tools, but they don't work at scale anymore, we need to break down the monolith - which is why we are pushing to move to a cloud-native deployment process.

Is this the right issue to do so?

I don't think that this is the issue to explore that, this issue is more about the building blocks we need to move away from the omnibus image by using the docker image. If we have a cloud-native smaller set of images then we will be using those instead to make better use of resource and even go faster.

We have a key objective to deliver a cloud-native deployment method, and so would like to incorporate your requirements and insights into that process. How can we drink our own champagne here?

Yes, I know about the deliverable. When this issue was written that was not a deliverable yet, but this issue was written down in a way that we were able to leverage that deliverable if it became a thing, in case it wasn't, then we would be making our own wine inside production (I don't like champagne so much ).

We are in touch with the build team and we will work together as we move forward, it just doesn't make sense to start making a lot of plans just yet. What matters for us is that we have a bunch of small images to deploy instead of a single monolithic one. As soon as we have the first block of tasks done (building blocks) we will start using these images to deploy development branches and accelerate development deliveries this way.

mentioned in issue #1849 (closed)

changed the description

added automation availability labels

changed the description

mentioned in issue #947 (closed)

@pcarranza as briefly discussed earlier today, can you point out what you think would need to happen to be able to have just a single canary? It sounded like the application would need to be able to work with two "versions" of the database? \cc @stanhu

I think there are several cases to consider:

No migrations
Column was added
Column was removed
Column was renamed
Column types are changed
Column usage is changed (e.g. a column is repurposed to store simple strings vs. YAML/JSON data)

In the first case, then there are no issues having a canary deployment with just frontend-related changes. That would be a good start for the frontend team to test out basic CSS changes/frontend JavaScript changes/etc.

The other cases are indeed a bit trickier but may be covered by the "no downtime" migration system we are trying: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/9976/diffs.

Opened https://gitlab.com/gitlab-com/infrastructure/issues/1920 to continue the discussion of production using our production deployment methods and tooling.

@stanhu would something like http://www.liquibase.org/ help on the DB side?

@joshlambert Liquibase aims to make it easier to track database migrations, but it doesn't solve the problem where the application may be expecting the database to have a certain schema. I'm not sure it buys that much more than what Rails does today for us with ActiveRecord migrations.

For example, let's consider the case where the application renames a column from old_column to new_column. If the old application code expects the column to be there (e.g. SELECT * WHERE old_column = 'asdf'), then things will break. Liquibase can't magically take that query and translate it for us. A lot of our Rails and Sidekiq code use database calls directly via ActiveRecord calls and therefore depend on having direct database access.

I think the reason is why this issue is complicated is that we are trying to make it possible for us to swap in an entire new fleet of workers of Rails and Sidekiq versions quickly and roll them back if necessary.

The real question is: can we design the application so that we can deploy a single canary Rails and Sidekiq worker alone? I think we are moving in that direction. For example, in the diff I linked above, the rename case is covered via the rename_column_concurrently method, which preserves the old column and relies on database triggers to maintain data in both columns simultaneously. I think that simplifies the deployment problem quite a bit.

FYI, I did some quick calculations for the number of database migrations we had per month for EE (post_migrate included):

# of migrations	Month
34	February 2017
30	March 2017
37	April 2017

It seems we average at least one per day, so we can't just focus on frontend-only changes.

Thanks for the detail @stanhu. The main reason I ask is that this is a common problem that many customers have asked when they consider adopting CD. I realize it's a challenging problem to solve generically but it's definitely worth keeping our broader customers in mind as we go on this journey.

Specifically to us though that seems like a really high number of migrations. Are these all destructive changes where old code will break? For changes that are breaking, do we have a process in place to ensure they are really necessary given the challenges they present with things like canaries?

I wonder if we didn't really worry about them in the past so it was sort of a casual thing.

Are these all destructive changes where old code will break?

No, I think the majority of them will will not break old code.

For changes that are breaking, do we have a process in place to ensure they are really necessary given the challenges they present with things like canaries?

No, we don't at the moment, but the big migrations typically revolve around direction issues. For example, last month's big issue was multiple assignees. Instead of having a single column for assignee_id, we had to store assignees in a separate table. That meant moving all the data from one column and insert 1.6 million rows in another table.

@stanhu @joshlambert Great that we are having this discussion!

I think we need to understand what we can do now quick and dirty and how does the world look like in the future.

As @ernstvn mentioned, we can have one host with the application in a different version on a single host, this would require us to perform some work to be able of sending traffic to it (probably by detecting a header in the request). The issue with this approach is that in every new package we ship there are migrations, which makes us treat each one as a completely new version with a full blown deploy, vaporizing the benefits of setting up this infrastructure.

As @stanhu is saying, we have roughly one migration per day (at least), and because we have people coding in gitlab-ce, which later on has to be merged into gitlab-ee in a manually crafted process. I think that it is extremely hard to ship small changes fast, simply because the process is cumbersome and slow.

I think that I've heard @DouweM, @rymai or maybe @smcgivern talking about moving to a single codebase to solve this problem. I honestly think that it could be a massive help to accelerate our package shipping velocity, and could reduce the amount of work required to keep these versions up to date and bug-free.

In a nutshell: even if we provide this canary host, we can't deploy anything that has a migration in it, and the cost of shipping the code is so high that it may overtake the benefits of building such infrastructure. From the infrastructure/production perspective, it is not hard to do it because we only need to create one host and isolate it HAProxy to only allow some users to use it, that may be a couple of days of work. But then we can't really use it because we don't have migration-free or at least DB compatible versions to deploy there.

Regarding the future: the way this issue is sketched is to move to a containers world, starting with enabling building and deploying branches that are not merged into master yet to accelerate testing, this is: we are starting not from a canary, but from the development process itself.

Since this is in the future, I was preemptively leveraging the ongoing conversations of moving to a cloud-native approach that is finally happening, which we will gladly embrace ourselves because of the obvious benefits.

In the process of moving that way we are also trying to help with the large migrations playground that is needed to prevent downtimes, and again to accelerate development, because that is yet another bit whale we want to hunt down, so it is in the roadmap.

But on the long run, if we get to the point where we have the ability to spawn containers up and down with different versions of service but we are shipping at the same rate we are now, we will just be hitting the same problem, making the canary deployment useless in itself.

In other words, should we be having a discussion to find out a way of shipping versions faster into production to get actual value from a canary? or should we be talking about the way to get changes in production faster and safer, like using feature toggles?

I'm not sure what 'moving to a single codebase' means precisely here, but we discussed GitLab team members developing on EE and merging back to CE at https://gitlab.com/gitlab-org/gitlab-ee/issues/1386#note_19880974, which came down pretty conclusively on the side of 'don't do that'.

But really, I think the CE -> EE merge process is a red herring here. If you want a feature in EE master sooner than will happen with the regular CE -> EE merge (for whatever reason), there is already a solution: create an MR to EE too. Getting a small change in EE master quickly doesn't really seem like the blocker to me.

@smcgivern then all we need to do is provide a host where we can deploy packages to, and we can have canary deployments something like next week?

No, because of database migrations - if these hosts share storage and a database, then we can't do that. But that's pretty much unrelated to CE -> EE merges - we could deploy an EE nightly now if we wanted, right? It might be behind a CE nightly in some cases, but that's not necessarily a problem.

I've opened this: https://gitlab.com/gitlab-com/infrastructure/issues/1924

We will deliver it in this WoW so we can start understanding what works and what doesn't.

mentioned in issue #1944 (closed)

mentioned in issue #1883 (closed)

mentioned in issue #1211 (closed)

added goal meta labels

added cloud native label

mentioned in issue gitlab-org/omnibus-gitlab#2571 (closed)

mentioned in issue #2523 (closed)

mentioned in issue #2524 (closed)

mentioned in issue #2526 (closed)

mentioned in issue #2604 (closed)

@omame pointed me at this issue in regards to the GCP Migration project.

While many of the tasks in this issue are related to the migration, I think we need to spend some time choosing some candidate tasks from this list to become part of the GCP Migration, while others will need to kept outside of the scope and either be implemented before the migration or after.

@pcarranza could you spend some time thinking about which of these tasks need to be incorporated into our target architecture and which are out of scope until after the move is complete?

Once we have a candidate list, we should add those to the https://gitlab.com/gitlab-com/migration backlog.

@andrewn

This issue is a high level plan of what infrastructure has been trying to do for quite some time. In fact it aligns with GCP quite well, basically because this is a path to move away from having every host as a pet, and move to a more dynamic environment instead.

From the perspective of the GCP Migration, and assuming we are going to be forklifting some things (as an MVP) that are still going to be treated as pets (I'm thinking about database, redis, and git storage servers as they are defined in the architecture documents) we will need to reach from within the k8s cluster to the outside world.

This means that we will need to locate these services through DNS names, we can do this with k8s stub domains, but we still need to feed this from outside so we can change the outside infrastructure as needed.

Additionally, we will need to manage secrets such like the redis password (that we are currently struggling to rotate, and that we will have to do it with downtime because of our current rigidity), the database, internal CA, or even user credentials.

Because of this, the only bit that I think doesn't make sense anymore is the use of consul-template, because all the configuration for the pods should live inside the cluster itself, but some parts that build that configuration have to be fed from outside (storage endpoints, database endpoints, etc).

Additionally, we have been relying on consul for architectural pieces that make no sense to change now, such as database read replicas (using DNS to locate them) or the database failover mechanism.

There are some implementation details about how many clusters we need and what for, but I think that that's a different discussion now.

So, from the GCP Migration perspective, I think that we need to consider that while we build the pets infrastructure we will also be dealing with secrets management and service discovery for the storage and database pets, and then we will have to wire this to monitoring (prometheus) dynamically.

Once we have this covered for the migration exclusively, we can then start using this building blocks for building multiple pristine environments by wiring all this to packer, and continue with the plan proposed here.

So far, we have been avoiding getting into the definition of pets in the cloud native diagrams because we understood that this was going to be forklifted into GCP. Maybe for that reason this flew under the radar of the the migration project, simply because all this was already ongoing.

If we want to change what we currently have with pets and move them inside the kubernetes cluster, then we will need to find a way to manage those pods secrets lifecycle and service discovery differently, but as long as we keep some things as pets, we will need to keep this plan ongoing, or increase the scope of the migration to find a different way of reaching the same goal.

cc/ @omame

removed milestone

assigned to @jarv and unassigned @pcarranza

I think there is some great discussion in this issue and many things we can apply towards the migration plan, in general however I think the context is OBE and vote to close. If anyone on the cc disagrees please feel to comment and reopen.

closed

@jarv OBE as Order of the British Empire?

overcome by events

[Meta] - Getting to canary deployment, review apps, autoscale, life, the universe and everything else

Reasoning

Steps required to get to canary deployment

Setting up the building blocks of the underlying infrastructure

Using the newly built infrastructure to accelerate development

Staging Deployment Process

Production Deployment Process

Canary Deployment

Questions

Designs

Child items ...

Activity