Netflix Ops for Everyone Else: NOFEE

Problem to solve

Operations is hard. Making world-class infrastructure that is able to withstand whole-region outages is incredibly hard, yet everyone deserves to have this kind of reliability. Let's make it easy to get Netflix-level of operations by just using some default best practices built into GitLab.

Target audience

Operations manager, CTO

Further details

Riffing on https://github.com/GIFEE/GIFEE.

It isn't a hard research problem but a large integration problem that takes a large investment that most companies aren't prepared for.

One primary goal would be to survive region outages. This implies things like database replication. We might have to be prescriptive about what kinds of services you can rely on. e.g. Postgres and Redis only.

GitLab already has great CI and CD, which is a foundation for this, but there's more to it than that. For example, Netflix is known for having a robust microservices architecture where the failure of an individual service degrades gracefully. e.g. if the recommendation engine fails, the UI just silently drops that component.

Make a list of features, also look at Spinnaker and link to their white papers:

Canary
Chaos money
Environment
Incremental deployments
Feature flags

Netflix Ops for Everyone Else: NOFEE

Problem to solve

Target audience

Further details

Proposal

What does success look like, and how can we measure that?

Links / references