Define an initial policy for Runway service size

added Category:Runway ServiceRunway groupscalability teamScalability-Practices workflow-infraTriage labels

added to epic gitlab-com/gl-infra&969 (closed)

Shared codebase: if two services share the majority of their codebase, this may indicate that they should

@andrewn - Is this sentence complete? You might have been caught by a save issue here.

I think this makes sense and ties to the larger piece of work that we want to do on Discussion: Introduce an 'Infrastructure Well-A... (gitlab-com/gl-infra&1350) and Establish MPR (Minimum Paved Roads) for Scalabi... (gitlab-com/gl-infra&1272).

The doc could be hosted either in the Handbook or the Runway docs (either option should cross-reference the other)

FWIW I think we should continue to colocate all of our Runway relevant info in the runway docs as it has been well received so far.

Fixed. thanks @swiskow

changed the description

I was thinking that this could be calculated with a score:

Criteria	Score
Self Managed	10
Internal Only	0
Team Already has Runway Sevices?	10
Shared Codebase with Existing Runway Service	0 for no shared code, 10 for completely shared
Interdependent	20
Has Runway Dependencies	5
Coupling	0 for no coupling to other services, 10 for tight coupling
GRPC vs HTTPS	0

... and any score over 20 may be an indicator that an existing service should be selected.

I think a score is a general good sense thing that will be easy to understand. It also looks like there are a few key dimensions that I can pull out from your table:

Customer Group
- Internal
- Self Managed
- Dedicated
Runway Maturity
- New Service, new team
- Addition to existing service
- New service, established team
Technical Complexity
- Interdependency
- Coupling
- Complex non-functional requirements - e.g. no circular dependency

I think with something like that we could guide that a score over a certain number in any of those top level areas should mandate a new service and also that a score above a certain total should also mandate a new service.

I don't think we have to get into anywhere near the sort of depth that CVSS gets into, but it is a fairly objective measurement and could be easy to translate into a more simple version for this. We have done this for a breaking changes calculator so there is a precedent - calculator prototype, calculator on gl-pages.

I had not seen that calculator before and it's great! I like that you need to click on the various options and that the calculator adds an explanation as to the result.

Conways-Law considerations: are the services build by two teams or a single team? If two teams, having multiple services may be more reasonable.

In a 2016 talk Martin Fowler said that this was the most important point to him: a service architecture allows relatively small cross-functional teams to work (mostly) independently.

I'd argue our guiding principle should be

"Each service is owned by exactly one team. A team can own multiple services, but not vice versa."

P.S.: Martin Fowler's Microservices article also contains many great considerations.

"A team can own multiple services"

For this comment, I'm focusing on this piece of the statement only.

If we have multiple services deployed through Runway, I'm expecting that this will place additional burden on Dedicated or Self Managed instances to be able to deploy these additional services. But if it's necessary we should not prevent a team having multiple services. Should we add something to the readiness reviews to ask the team to write down (to record the decision) as to why multiple services were required?

"Each service is owned by exactly one team."

In the event that a service cannot be owned by one team, it needs to be on a different Paved Road. I expect that there will be some services where, for whatever reason, ownership must be shared. We've experienced that shared ownership is much harder to manage, and if it is absolutely required, those teams should be guided down a "shared ownership" road. I don't know exactly what that looks like, but it should make it clear to the owners that this path comes with very specific challenges.

@swiskow Are there any candidate services for Runway where it looks like shared ownership might be required?

@rnienaber - AI Gateway is starting to become pretty complex. Much of the direction to move the third party model request logic into a centralised location makes a lot of sense.

However, this has also created some inertia for adding absolutely everything into AI Gateway and there are now active discussions about pulling things out of there - Separate AI Gateway and Duo Workflow projects (gitlab-org/modelops/applied-ml/code-suggestions/ai-assist#527 - closed)

As far as I know, Duo Workflow is owned by a separate team and group, but works inside the AI gateway service.

I think @fforster captured this well in #306 (comment 1995205624)

The guidance we should give is to split services along business functions.

Other than that example, there is nothing that comes to mind or is listed in Candidate Services for Runway Deployment (#48) that would require a shared ownership model. I think the services that really require this are the large core services that power our monolith and we have a good handle on those today. Although we do need to do more.

Requirement to support both GRPC and HTTP(s): this should not be a good reason to separate services as Runway supports multiple deployments

It depends on the business function. Separating (HTTP) front-end and (gRPC) back-end can make sense, especially if we need to scale the front-end and the back-end independently. Separating one logical back-end into a REST service and a gRPC service (sharing 99% of code) makes no sense.

The guidance we should give is to split services along business functions.

In my personal experience, teams have a tendency to add (unrelated) functionality to existing services instead of creating a brand new service. Teams creating an unreasonable number of way too small services is not a risk I'm expecting to see.

Side note: that a service either offers an HTTP endpoint or a gRPC one is a Cloud Run limitation. On Kubernetes runtime doesn't put such a limitation on us, though it may be easier if we restrict services to one serving port initially.

cc @DylanGriffith who raised this as a potential concern in a sync discussion.

I'd also like to see consideration for avoiding a proliferation of different programming languages and architecture styles. I personally still operate under the philosophy that if something can be added to the Rails monolith then it should be.

For a long time we pretty much only ran Ruby and Go. And there were very strong reasons that certain things had to be written in Go (not Ruby). Then the AI stuff came along and it became apparent that the industry and open source tooling was so heavily built around Python that it wasn't worth us trying to go out on our own and build stuff in Ruby. I still think that might be debatable if we ever see the AI stuff settle down a bit as a lot of these AI systems are fairly simple wrappers around API calls to REST APIs and at that point in time it might appear that Rails and the monolith are better choices.

But I'd definitely be concerned about just deploying new services in new programming languages just because it's the teams preference. Especially where those teams may be new to GitLab and not know all of the sophisticated stuff we have built into our release processes and monitoring for the monolith.

Prior to runway there was a natural pressure to add stuff to the monolith (and write it in Ruby) because it was just too difficult to get new services deployed to production without it. With runway that might change things and we need to be deliberate about this before it gets out of our control.

@DylanGriffith just thinking out loud: one way of controlling this is by enforcing the use of LabKit(-Go), LabKit-Ruby and LabKit-Python (TBD) in services, as an "in-process" component of the Platform.

In other words, all services need to ship with LabKit compiled in, providing in-process Platform integration of logs, metrics, traces, exception-logging, etc etc. This could become a mandatory component. Without a LabKit library, the service would fail to obtain certification to operate.

cc @fforster as there has been some discussion about aligning LabKit more closely with Runway.

(it is worth pointing out that LabKit should also run in non-Runway services, but perhaps LabKit should become mandatory for Runway services?)

@andrewn I think you're onto the right idea here. Basically we should be encouraging uniformity across GitLab by building re-usable components that make it much easier to deploy and manage your application if you are using our standard languages/frameworks/databases and so-on. Labkit is definitely in this category and I think this could extend way further to a set of standard tools that make up a "mature" production service. This would be prometheus metrics, Grafana dashboards, Tamland forecasting, alerting in Slack and the most important human element of "SREs are actually able to support it".

the service would fail to obtain certification to operate

I don't know how successful we will be (especially given what I've seen historically) in setting up these "required" certification processes. Product teams will always seek to get exceptions based on "important deadlines" and ultimately it won't really be something that is enforced.

That's why I'd prefer to stress what benefits you get by sticking to our conventions. The biggest thing I see where we can start to be more explicit is the benefit of having SRE be on-call for your service. I think it's very reasonable that we would say that SREs won't be on-call and won't respond to alerts for a service that is written in Haskell and running in a random VM on GCP in some random account. But I haven't seen us being explicit about what that means for product teams. We should still expect that some teams might still want to "launch" their product and there might be reasons we still want to allow teams to operate with this level of autonomy but it should be clear that those teams can only offer customers the on-call and support guarantees that those team members are willing to take on.

Labkit is definitely in this category and I think this could extend way further to a set of standard tools that make up a "mature" production service. This would be prometheus metrics, Grafana dashboards, Tamland forecasting, alerting in Slack and the most important human element of "SREs are actually able to support it".

I think there is general agreement on this point: for instance in gitlab-com/gl-infra/scalability#2612 (comment 1986420072), the place in which Stage-Group Owned, In-Application-Defined, User-Journey SLIs will be defined is LabKit-*.

I don't know how successful we will be (especially given what I've seen historically) in setting up these "required" certification processes. Product teams will always seek to get exceptions based on "important deadlines" and ultimately it won't really be something that is enforced. That's why I'd prefer to stress what benefits you get by sticking to our conventions.

Agreed: we need to focus on the benefits and accept that, in a customer-focused company, these deadlines are important, and sometimes stage group teams need to do things ahead of the infrastructure being developed. It would be the wrong approach for us (in Infrastructure) to think that we can hold teams back until we've come up with ideal solutions. We have to be flexible, and highlight the advantages.

A good case study is gitlab-org/omnibus-gitlab#8467 (closed), where there was very strong customer demand for the AI Gateway, ahead of the infrastructure support for Self-Managed Runway services. In this case, the best approach was to assist the stage group in delivering what they needed for the customer, but also being clear that these special snowflake deployments will need to be supported by the the stage-group teams (no oncall etc). Additionally, these deployments will not be supported in Dedicated.

This gives early adopters some flexibility while infrastructure catches up to support what the application development teams are doing.

I think you're onto the right idea here. Basically we should be encouraging uniformity across GitLab by building re-usable components that make it much easier to deploy and manage your application if you are using our standard languages/frameworks/databases and so-on. Labkit is definitely in this category and I think this could extend way further to a set of standard tools that make up a "mature" production service. This would be prometheus metrics, Grafana dashboards, Tamland forecasting, alerting in Slack and the most important human element of "SREs are actually able to support it".

@DylanGriffith You've described well what is written in our Product Direction - Scalability-Practices Page.

We're very much at the start of this journey, but we envision paved roads that describe exactly what you have laid out and provide a series of options for teams at GitLab to build and innovate. Even making space for totally independent services where that is what is required to meet the market demand. There's still some work to do on this but I have written some initial thoughts in Establish MPR (Minimum Paved Roads) for Scalabi... (gitlab-com/gl-infra&1272)

added teamRunway label and removed teamScalability-Practices label

added sectionplatforms label

Define an initial policy for Runway service size

Designs

Child items 0

Activity