We have seen an urgent need to define more application limits within GitLab, to ensure that noisy neighbors don't impact performance and availability of instances. (For example gitlab.com)
While we have stated that these limits should be configurable, we have yet to define where and how they will be configured, and how both users and operators will know when they hit them.
Proposed solution
Since there is a sense of urgency around implementation of these issues, we should have a crawl/walk/run approach.
MVC
Goal: deliver limits on GitLab.com, where they are most urgently needed
Configuration: Limits defined in a Plan, with defaults applying to only GitLab.com (Plan's do not exist on self-managed today)
Changed via rails console
Audit log event when value is changed
Documentation: Limits and their defaults defined in documentation, along with how to change
Operators: When a limit is hit, it should be logged. These can then be consumed in the customers logging solution, like Kibana.
End users:
For API's, standard response code with additional information on why it failed due to the limit.
For interactive limits, work with your UX person
Next iteration
Bring Plan to self-managed, with a Plan.default which applies globally
I don't think that gitlab.yml is easier, as you need to add this config as well to Omnibus.
There's a number of ways to avoid having to enable the interface. I would rather expect something that is part of DB.
@ayufan would this be a rails console command? That's going to be pretty hard to work with as a typical operator, comparatively. (Task runner action on the helm charts, etc.)
I proposed about the Plans. What about them? This seems like a natural place to put, and we already have some limits there. I would rather focus on resolving Plans than introducing something that is completely new.
If we would add default plan in DB, this effectively resolves all the problems, as we have single place to put limits, and can fine tune them.
I'm sorry I misunderstood what I think you were proposing. I had thought you meant this should be something devopsplan should pick up, but clearly I was wrong
Are plans something that is exposed in the UI, or just a construct in the DB? I don't think I am familiar with them.
Note - if there is a path to these Plan settings living somewhere more accessible (e.g. gitlab.yml) I think it's okay to be a console command / SQL query change for now in the interest of iteration.
One of the outputs of the Customer Advisory Board is that nearly everyone is managing their GitLab instances via terraform, and is driving towards all automating all of their runbooks. We could consider an API to set these versus a configuration file, but it them becomes less clear on what current state is without the UI in place.
With my as-yet-limited knowledge of GitLab, gitlab.yml seems like the obvious place to configure limits, but I don't have knowledge of Plans that Kamil is referencing.
For notifications when the user is a bot, I'm assuming those notifications would go to the administrator?
Maybe not first iteration, but does it make sense to have notifications when limits are being approached rather than wait until they are reached?
@joshlambert I definitely think making these limits API-configurable-and-readable makes sense to me. Being able to observe and act upon the current state of limits allows operators to hook this in to other systems and adjust them reactively (which is awesome).
Some example stories:
As a GitLab Operator, I'd like to be able to see the current state of a limit (both it's current and max value), so that I can understand the health of the system.
As a GitLab Operator, I'd like to be able to set limit programmatically from the API, so that I can dynamically adjust it during periods of high usage.
(This is probably another great place where an event stream would be valuable for operators)
From an implementation standpoint, I think setting these in gitlab.yml is a fine first step. I don't think that prevents us from also setting those in the DB in a later iteration -- we can think of it as the gitlab.yml file providing the default/fallback state, and the value in the DB being the user-configurable state that can then be modified in a (later-built) UI.
I agree with @deuley - an MVC to configure limits in gitlab.yml provides a simple mechanism to manage limits while we suss out a robust API solution for better integration with other tooling.
My concerns with Plan is that it is a GitLab.com unique concept, and the defaults we ship for GitLab.com should also apply to self-managed users as well.
From an implementation standpoint, I think setting these in gitlab.yml is a fine first step. I don't think that prevents us from also setting those in the DB in a later iteration -- we can think of it as the gitlab.yml file providing the default/fallback state, and the value in the DB being the user-configurable state that can then be modified in a (later-built) UI.
If we're going to do this configuration as code, via the config file, it doesn't seem like a good idea to also allow configuration via the API/UI.
I'm not super convinced gitlab.yml is indeed the right way to configure and store these limits. Storing in DB and adding an API to manage also feels like a good MVC. @ayufan@craig-gomes@fabiopitino WDYT?
Yes. I mention the gitlab.com Plans. We could likely migrate that to always define a default plan, and have more plans for GitLab.com. With very little effort we could backport and add Plan.default to GitLab CE, without requiring to use our GitLab.com tiers.
Adding to gitlab.yml is not really:
better: it increases maintenance burden, as it needs to be updated in multiple places and supported extra by production team,
faster: the omnibus-gitlab, helm charts, and cloud-native needs to be updated to support new flags,
future-proof: we know that this is a burden to use for production, as they will have to map all limits into chef-repo, and each change of limits will result in rolling-update of the application across all fleet,
future-proof: we want this to be easily configurable in the future and present the limits to the user and operators,
quick to respond: if we find a temporary outage, we might want to temporary put more restrictive limits, like a number of pipelines running, this cannot be done without rolling update, changing state in DB has immediate effect on all nodes.
That way, we retain an ability to fine-define limits for different groups, and store them in database as well.
I think that all limits should be dynamically adjustable, otherwise it does require application restart which is quite troublesome for infrastructure team. We also moved quite some time from using gitlab.yml for fine-grained application configuration, and instead we tend to use application_settings where is possible, as it provides much better interface for users.
And this could be very easily define a defaults and modify defaults via DB migration. We don't really need to make this exposed to users. We expect to define a sane defaults for all our features, but if needed for us it could be like this (make a change only to the free users, not a paid users):
Plan.default.update!(web_hooks_limit: 100)
Now, if we add a new column to the plan, we would by default add it with a default value. It means that limits would be automatically enforced.
It is very small and incremental change to have something working. As, we have these ways to move forward:
Add Plan.default: this allows us enable pipeline limits (linked issue), I know that everyone will be super happy about that,
Add Plan.web_hooks_limit: we cannot enforce it for free users, but we build a logic to provide some limits, we use validation to check the limit for the given namespace,
Backport Plan.default to CE, move just Plan model, and request a Plan.default by default unless running Gitlab.com?, this makes it consistent as for the behaviour.
As for UX: we already have validations today that for a lot of purposes serve very well as far for user responses: the validation messages are presented to users when in as above example: user tries to create webhook via Web or API. The validation message is presented. I don't really think that we need anything more at this point until we find where it does fail.
Anyway, I think that this covers pretty much all usage-patterns that I can think
of from the perspective of CE/EE/GitLab.com and ability to dynamically adjust limits.
Plan.default is basically faster to implement and easier to deploy and already represents a design that is future-proof.
I don't think that prevents us from also setting those in the DB in a later iteration -- we can think of it as the gitlab.yml file providing the default/fallback state, and the value in the DB being the user-configurable state that can then be modified in a (later-built) UI.
However, this means that we will have to migrate and support two sources of information, figure out which one wins and is more important, and at some point deprecate it. If there's an better way to make it extensible by default I don't see a strong reason to start with solution that we know that is deficient already.
@ayufan@joshlambert I've just started working on #32823 (closed) which essentially implements the concept of Plan.default that @ayufan mentioned before. This let's us define limits for free plans which currently we are not able to. Next step would be to propagate this to CE.
To add more context on Why is this implemented under Plan?:
In this issue #3493 (closed) (2y old now) we added initial support for CI/CD limits. Limits where implemented at Plan level which made sense for Gitlab.com to control them per plan. Next, we added another limit following the same approach gitlab-foss#51401 (closed) and now we want to enforce those limits: #32823 (closed).
There is an issue to propagate these limits to self-hosted: #32824 (closed). I'm not sure the concept of Plan.default makes sense for self-hosted but it's a very good MVC as we can propagate the limits with minimal change.
This however would need to be consolidate IMO under ApplicationSettings and invert the relationship with Plans: on Gitlab.com use Plans otherwise default to ApplicationSettings limits; on self-hosted there won't be Plans so limits can only be defined in ApplicationSettings.
I would say that we always should have Plan.default also on self-hosted. There's no really a strong reason to not create a default entry for every install. This makes things much easier. Maybe this at some point evolves into user-creatable plans, that administrators could assign to their own projects, and then it is no longer GitLab.com thing anymore. It is possible, not required, but having .default seems like a great simplification.
Thanks @ayufan and @fabiopitino for the discussion. I do think there is value in trying to create a standard way to implement these, as we will be adding quite a few in the very near future. If we all do something a bit different, it's going to be very hard to manage.
To zoom out a little bit, the requirements I think would be:
Limits and their defaults would apply to GitLab.com and self-managed
They should be relatively easy to adjust by a GitLab administrator
It should be clear what limits are in place, and their current values
It should be clear to both administrator and end users when limits are being hit
If we can accomplish this with Plan and Plan.default, great. The benefit that this is extensible is good, although I'm not sure we have a concrete need for this right now.
Two follow up questions:
Would instance-level limits be applicable with this design? Or are Plans scoped to a project/group?
What would the path forward to bringing Plan.default to self-managed look like?
Would instance-level limits be applicable with this design? Or are Plans scoped to a project/group?
Plan today is assigned to namespace. We have Plan.default that is used if no plan is assigned to namespace.
What would the path forward to bringing Plan.default to self-managed look like?
We would backport the Plan model to CE, but leave all gold/bronze/etc. to be part of EE. The CE would only have Plan.default support, nothing else.
I think it is good enough to define limits per-namespace. The current design does not allow to change some aspects of plan to have different limits if needed for namespace. One case would be to have plan and make it more contained for the given user, but likely this could be modelled on top if really needed.
I would actually expect that we would introduce at some point one extra plan: abuse (or a bulk of plans abuse_gold), that namespaces that do perform a lot of harm would be attached to, to reduce a strain on system.
Maybe, we could allow to define behaviours and exposed features as part of plans and database, this could allow more fine-grained control of plans, and allow to create multiple gold plans with different limits based on abuse patterns.
@ayufan I think at some point we would need to account for overriding limits for specific namespaces for fine tuning.
Plan.default would work as default namespace limits on self-hosted
Plan.<specific> is defined at namespace level for specific plans on Gitlab.com
Admins should be able to override limits for specific namespaces that are prone to abuses (values must not be higher than Plan.default limits)
I'm a fan of moving limits out of namespaces and plans tables into dedicated namespace_limits and plan_limits having the same schema. Then, we could have namespace.actual_limits to be namespace.limits || namespace.plan_limits
Thanks @fabiopitino and @ayufan, appreciate you helping guide me through this.
I updated the description with the most iterative approach I could think of, with an initial focus on delivering limits on GitLab.com as that is where we are seeing the availability challenges.
Does that look reasonable? Please feel free to edit as you see fit.
I agree with @ayufan comment in #34634 (comment 234524010) . The only item I would add is that if we end up doing something like that we have to have an audit log.
Right now our audit log is commit in chef-repo whenever we change a setting in gitlab.rb (gitlab.yml). Written trail of who changed a setting and at what time is absolutely necessary.
Right now our audit log is commit in chef-repo whenever we change a setting in gitlab.rb (gitlab.yml). Written trail of who changed a setting and at what time is absolutely necessary.
I've added it to the description under MVC. @ayufan can you do a quick review to ensure it lines up with your thinking as well?
Note I've still left this change-able by rails console to reduce scope and avoid having to add an API for this in the first iteration. I imagine we can add an API call which simply invokes the rails console command as a subsequent iteration.
On a broader note, it is clear that our enterprise customers expect to be able to provision and configure GitLab via code. At the most recent CAB, nearly everyone was maintaining automated runbooks for most maintenance tasks.
I hear you both on the downsides of gitlab.yml but this does reinforce the need for a CLI, terraform provider, or other method of easily changing these values outside of gitlab.yml.
I've added it to the description under MVC. @ayufan can you do a quick review to ensure it lines up with your thinking as well?
We can easily produce an audit log trail on AR object save. This way it would as well create a trail when used from rails/console.
On a broader note, it is clear that our enterprise customers expect to be able to provision and configure GitLab via code. At the most recent CAB, nearly everyone was maintaining automated runbooks for most maintenance tasks.
I still battle to understand why our customers want to change application limits that we know that application is tuned for and performing at given level. This is not a standard type of the configuration, like object storage that should be configurable for all nodes, or the application endpoint, or a gitaly shards to use.
Can you describe in more detail what is the usage pattern to modify for example a number of allowed webhooks for other instances? or any other limit that it should be exposed via gitlab.yml or allowed to be configured via Terraform?
These limits are meant to prevent misuse of the application, and define by US how well we know that application can handle given workload. I would be very reserved to allow too easy configuration of them, as easy configuring them via Terraform, as there's a big chance of introducing availability issues if configured wrongly.
I would rather expect that we might expose UX for configuring some of them from GitLab at some point rather than require to us Terraform based configuration for such things. I assume that application limits needs to be consistent across all nodes, it is impossible to make some nodes to have different limits than others, nor I see this to make sense.
@joshlambert should the notifications be broken out in a separate issue or are the two completely linked together?
Not linked together, but I'm trying to drive to a consensus on what should be done for the limits already in flight for %12.5. We realistically have ~2 weeks to decide and implement, so these can get into review. I'm concerned having the discussion split across multiple issues will make achieving and following the discussion harder.
For next steps we likely can, and should, promote this to an epic and use separate issues.
I think they may be tightly linked. If we just gave the ability to configure limits via a .yml file and a user hit the limit, someone would need to get a notification to understand what's going on. Could either be the user, or the admin, ideally both.
I think they may be tightly linked. If we just gave the ability to configure limits via a .yml file and a user hit the limit, someone would need to get a notification to understand what's going on. Could either be the user, or the admin, ideally both.
Agreed we need to implement both. Limits without some way to monitor them, as well as a reason why something isn't working from an end user perspective, is a recipe for frustration.
In the description I've proposed using logging for operators. For end users, API responses similar to what we do for rate limits today, and we should involve UX for any interactive limits that are displayed.
I commented here, because it seems that we are running in circles, and a lot of is already done that makes this much smaller of the task from UX, API, development and production access: #34634 (comment 234524010)
Thanks for noting, this issue is also part of that epic.
I commented here, because it seems that we are running in circles, and a lot of is already done that makes this much smaller of the task from UX, API, development and production access: #34634 (comment 234524010)
Thanks @ayufan for continuing to discuss. As noted above (#34634 (comment 234648330)) I do think it's important that we get this both correct, as well as standardized, as individual stages are going to be implementing quite a few limits in the near future.
If these get implemented differently in the code, if they have different interaction patterns from an operator/user standpoint, I think this will become quite unmanageable and confusing.
We are very close to merging pre-seeded DB plans for EE that can be used for that purpose. It seems that next step will be to add additional table linked to the plan that will be used for that purpose.
This would effectively then mean that we have a way to define and configure limits for on-premise EE (one for all), and GitLab.com (for given Plan).
@joshlambert will this issue become owned by a certain stage group? This issue will benefit from being turned into an epic and going through an exploratory cross-functional validation track.
@joshlambert will this issue become owned by a certain stage group? This issue will benefit from being turned into an epic and going through an exploratory cross-functional validation track.
@dimitrieh - AFAIK there isn't a good stage to own this, this is one of those foundational things we will need to work on together. There is also urgency here, as this can impact availability on GitLab.com. Some teams are already moving ahead with implementation.
As we get further along, it would probably be a good idea to start to think about a design where an administrator can view and set these, but for a first iteration I think the rails console followed by the API is a good start given the urgency.
@ayufan@fabiopitino - I just wanted to check in if we were able to make progress on documenting this. If not, I can try to write something up if you point me to where it would be best located.
@joshlambert we are currently observing how easy would be a for another engineer to add limits and document that by consequence. We are doing it with !20730 (merged) - cc @arturoherrero
@timzallmann this is a starter for the configuration limits being done. Please use this as a reference and reconcile with your current action from availability meeting.
So in the meantime, the issue around the documentation was closed #38250 (closed) as the last part was merged which as far as I have seen should be enough to get everyone aligned on this.
Based on this I would send out pings to all EM's who were coordinating actual limit implementation based on &1737 so that they revisit all the build limits to follow the now documented model and add them to the documentation.
@deuley is working on documenting the PM side of implementing limits here: gitlab-com/www-gitlab-com!37687 (merged). This includes notifying the field, communicating to potentially impacted customers, and more.
Sorry, I think I'm a bit unclear on the ask. Are you asking if I'm supportive of adding application limits for audit events traffic? Funny enough, I just asked a customer about their use case that triggered such a high usage rate.
I'm generally supportive, but I think we should implement a data stream service of some kind to stream data to customer systems, like they prefer, which should also greatly reduce this traffic in many cases.
My opinion is: yes, that's likely an auditable event likely for security teams or someone else investigating incidents or performance issues. For the typical auditor/compliance persona, I'm not sure it qualifies as an important event.
I would err on the side of caution and say: yes, we should generate an audit event for those events.
I'm going to close this - as we have largely achieved the main goals of providing a framework for application limits. The next steps we can take based on further demand/validation, such as a consolidated UI versus the state today where most limits are in their respective settings areas.