Deploy prometheus through kubernetes and autoconnect to cluster
What does this MR do?
- Configures the Prometheus service automatically when installed through Kubernetes cluster integration.
- Updates UX to suggest installing via cluster integration in the environment monitoring "getting-started" page.
- Adds link from cluster prometheus app install section to the prometheus integration config.
- Updates prometheus service page to indicate whether it is automatically or manually configured
Are there points in the code the reviewer needs to double check?
Why was this MR needed?
Screenshots (if relevant)
monitoring getting-started page | cluster applications page |
---|
service page (no config) | service page (automatic config) | service page (manual config) |
---|---|---|
![]() |
![]() |
![]() |
Does this MR meet the acceptance criteria?
-
Changelog entry added, if necessary -
Documentation created/updated -
API support added -
Tests added for this feature/bug - Review
-
Has been reviewed by UX -
Has been reviewed by Frontend -
Has been reviewed by Backend -
Has been reviewed by Database
-
-
Conform by the merge request performance guides -
Conform by the style guides -
Squashed related commits together -
Internationalization required/considered
What are the relevant issue numbers?
Closes: #30480 (closed), #28916 (closed)
Merge request reports
Activity
added 1 commit
- b38b5ceb - Move client creation to Prometheus Application, manufacture proper rest client
added 1 commit
- 0c802f4f - Manual Configuration instead of Activation. Prometheus Service just got a bit weirder
added 1 commit
- 249c9a8c - Auto enable prometheus service if Prometheus is Installed
added 1 commit
- 80d4c067 - Add test checking if prometheus integration is enabled after prometheus is installed
added 1 commit
- e308bb0c - Cleanup implementation and add cluster finding tests
added 1 commit
- ae9c8277 - add tests for Manual configuration override and service activation synchronization
The UI still needs some polishing, the current Prometheus Service page is pretty rudimentary. @mikegreiling will you be able to work on that part?
I'm quite happy with the backend. Its now ready for review,
I think @ayufan you would probably want to take a first look, as I'm interested what do you think about current multi Prometheus server support in PrometheusService.
assigned to @ayufan
added ~2391246 Category:Kubernetes Management labels
changed milestone to %10.4
Moving to 10.5, agree this probably should not make 10.4 at this point. Let's try to get this reviewed and merged ASAP though, as this forms the foundation for the alerting issue.
Edited by Joshua Lambertchanged milestone to %10.5
@dosuken123 @grzesiek Can you help review this MR? I'm looking at feedback of the approach and code implementation.
added 1093 commits
-
1c7a61af...ad5ce890 - 1092 commits from branch
master
- b0d784a4 - Merge remote-tracking branch 'upstream/master' into pawel/connect_to_prometheus_through_proxy-30480
-
1c7a61af...ad5ce890 - 1092 commits from branch
added 1 commit
- 3727dfcd - add manual_configuration to prometheus_service factory to enable it by default
- Resolved by Joshua Lambert
- Resolved by Joshua Lambert
- Resolved by Joshua Lambert
- Resolved by Joshua Lambert
@pchojnacki I reviewed this from the point of view of multiple clusters.
There are many back-and-forth in https://gitlab.com/gitlab-org/gitlab-ce/issues/28916 and I don't fully understand what the final specification of this MR, though, I assumed we wanted to keep Cluster(Prometehus) and Environment having 1-to-1 relationship.
Although, the current implementation is Cluster(Prometehus) and Environment having many-to-1 relationship.
This is only my concern. Feel free to ignore if this is intended.
/cc @ayufan
@dosuken123 @pchojnacki
I'm worried that
PrometheusService
starts to have multiple responsibilities and thus there's a big complexity connected with that. Currently, we have quite a strange situation where:- We have Cluster Prometheus application which provides a configuration,
- We have Prometheus Service which is provider and consumer of configuration, it provides a manual configuration or performs a cross-join for configuration (from PrometheusService -> Project -> Clusters::Cluster -> Cluster::Platforms::Kubernetes -> Cluster::Applications::Prometheus) that is environment dependent. Technically it makes Prometheus Service have an unlimited amount of different configurations. This makes me worried that since Prometheus Application is not aware of Prometheus Service and all the business logic related to metrics we might find in the future some strange edge cases with support for multiple clusters, and data caching and invalidation.
This makes me slightly worried about the complexity introduced by this MR :) I wonder if we can split all the logic for getting metrics into a different entity, and make a decision what provider we use at a higher level than Service? That way we would be able to make
PrometheusService
be very thin, and not dependent onCluster::Applications::Prometheus
, instead make this decision in another place.assigned to @grzesiek
I can't really comment on the best engineering practices, but we do need to ensure this makes it into %10.5.
So if it means some level of tech debt but we still ship, that seems OK to me for now. We can always improve the architecture and design in future releases. If the re-engineering work is relatively straight forward, let's get it done.
@pchojnacki what are your thoughts on effort level to address @ayufan and @dosuken123's comments?
In my testing Prometheus failed to install. @joshlambert believed this was related to gitlab-org/gitlab-ce#42496 so I'm going to try applying the changes in !16749 (merged) locally and re-testing.
Edited by Mike Greilingadded 580 commits
-
4b1d42bb...44edd111 - 579 commits from branch
master
- f87afeff - Merge remote-tracking branch 'upstream/master' into pawel/connect_to_prometheus_through_proxy-30480
-
4b1d42bb...44edd111 - 579 commits from branch
mentioned in merge request !16777 (merged)
mentioned in merge request !16786 (closed)
@joshlambert I've explored a bit how @ayufan proposed code layout would look like and while I can see some benefits, I think the timeframe given is insufficient to refactor the code and adjust the tests as well as pass the code review.
As for the complexity
PrometheusService
has around 160 lines of code. Where it does, however, add up its the current duplication of finding the right cluster for given environment. Where 'deployment_platform' looks for last available Kubernetes platform, the above code looks for first cluster with installed Prometheus.Would be good to clarify some edge cases here as well. I.e. what if we have two clusters one with
environment_scope = '*'
and the other withenvironment_scope = 'dev'
which cluster should provide metrics fordev
environment?Although, the current implementation is Cluster(Prometehus) and Environment having many-to-1 relationship.
@dosuken123 At the time of writing this MR there was no way to directly link single environment with specific cluster.
The best algo I came up with was to match
environment_slug
withenvironment_scope
with special case where*
matches all slugs. Then I pick the cluster with longestenvironment_scope
so that its the most specific match. When implementing that I had in mind the case where actual wildcards could be used inenvironment_scope
similarily toFile.fnmatch
Its one of the things I was interested in discussing in this MR (I could have pointed to that directly though).
So effectively this MR maintains 1-1 relation however there are some edge cases - like single letter environments.
I've explored a bit how @ayufan proposed code layout would look like and while I can see some benefits, I think the timeframe given is insufficient to refactor the code and adjust the tests as well as pass the code review.
Are we OK with proceeding with the existing implementation so we can still hit %10.5? This is a must deliver for %10.6, and we already have to deliver the cluster monitoring solution in that release.
Would be good to clarify some edge cases here as well. I.e. what if we have two clusters one with
environment_scope = '*'
and the other withenvironment_scope = 'dev'
which cluster should provide metrics fordev
environment?@ayufan @bikebilly have we identified the logic for this? Or is it currently a recommendation of "Don't do that".
added 1 commit
- 89eb6222 - enable manual configuration property for all test prometheus services
I wonder if we can split all the logic for getting metrics into a different entity, and make a decision what provider we use at a higher level than Service?
I agree with this. I wonder if we can only use
project.deployment_service(env)
for now, instead of looking for an available prometheus instance greedily. This cross reference for prometheus should be maintained under the group-level cluster or something similer. This way reduces the complexity and it's always searched/connected from top to bottom.Would be good to clarify some edge cases here as well. I.e. what if we have two clusters one with
environment_scope = '*'
and the other withenvironment_scope = 'dev'
which cluster should provide metrics fordev
environment?@ayufan @bikebilly have we identified the logic for this? Or is it currently a recommendation of "Don't do that".
@joshlambert @pchojnacki I think the problem is that we cannot predict or force the user to install prometheus in the "proper" cluster. The simplest idea is to consider the best-matching cluster (based on environment scope) that also has Prometheus available:
- environment is production
- we have a cluster with env scope production
- if this has Prometheus available, use it
- if not, continue
- we have a cluster with env scope *
- if this has Prometheus available, use it
- if not, continue
- if we still don't have a matching cluster, give up
This has some advantages:
- this is probably what a user may expect to have
- you can leverage a specific cluster for a specific env
- you can leverage a "catchall" cluster without installing one specifically for each environment
- you can choose to provide metrics only for some environment, not for others (edge case)
Disadvantages:
- if you have a cluster with a Prometheus server and with a specific scope, you cannot use it as a "catchall" (but it looks quite weird to me that a cluster specifically with a scope will be used also by other environments for metrics, since we are defining scope as a cluster property not strictly related to deployments only)
We can heavily suggest to install Prometheus in the * cluster since it is probably the standard case, and will ensure covering every environment that doesn't have a specific one.
What do you think? I don't have visibility on the engineering implementation, so this is subject to effort/complexity evaluation on that side. I totally agree we should have it done by %10.5 if possible and then iterate.
cc @ayufan
@bikebilly what behavior does the existing features utilize to handle this situation currently? For example if I have a cluster with
*
andproduction
, what cluster is used in the CI environment variables for jobs?In looking at the documentation, it's not really spelled out what the options are here. For example, can I do something like
production, staging
? Is something likereview/*
an option?In isolation I would recommend:
- Check to see which clusters match the environment in question
- In the event there are multiple, select the cluster with the most specific environment definition (Logic could be: No wildcard, then longest wildcard)
- For the selected Cluster, check to see if Prometheus has been installed
- If not, show the empty state.
added 168 commits
-
e25f383d...560c93e6 - 167 commits from branch
master
- 3a565d5d - Merge branch 'master' into pawel/connect_to_prometheus_through_proxy-30480
-
e25f383d...560c93e6 - 167 commits from branch
added 1 commit
- d9c8c995 - add link to cluster configuration from prometheus "getting started" state
added 1 commit
- 0b47134e - change prometheus service description text to match design
I believe I've completed the frontend work for this feature and it is now ready for review.
@pedroms would you like to take a look at it?
Here are some screen shots:
monitoring getting-started page cluster applications page service page (no config) service page (automatic config) service page (manual config) assigned to @pedroms
marked the checklist item Changelog entry added, if necessary as completed
added 1 commit
- bb79e3cb - fix broken karma specs due to removed function
added 1 commit
- 9cdec947 - fix rubocop errors and ensure prometheus_installed? doesn't fail in CI tests where project is nil
By the way I am fine with the logic noted by @pchojnacki here: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/16182#note_56661324
We can adjust as needed in the future, but should proceed with this merge. For nearly all use cases it will work as intended until we get a clearer picture on how it is implemented elsewhere.
@sarrahvesselov, it looks like @pedroms may be at FOSDEM. This is a critical MR to ship in 10.5, is there another UX person who could take a review? I imagine unless there are showstoppers, we'd make follow up issues at this point in time.
- Resolved by Jose Ivan Vargas
- Resolved by Mike Greiling
- Resolved by Mike Greiling
- Resolved by Mike Greiling
- Resolved by Mike Greiling
While there's no javascript tests (and the current ones got two scenarios removed from them). I don't think the changes are that significant to warrant the use of tests at the moment.
@mikegreiling Just left you some comments regarding i18n and some
_.escape
things, other than that LGTM, should be good to go to a maintainer once those comments are taken care ofEdited by Jose Ivan Vargas@mikegreiling @joshlambert it looks great from the screenshots. Setting this up locally is a pretty lengthy process, so I'm trusting @mikegreiling's screenshots (thanks btw!
🙇 ). Marking UX reviewed…assigned to @mikegreiling
Thanks @pedroms!
@ayufan @dosuken123, since this is a critical item to deliver in %10.5 I'd like to suggest that we create a follow up issue for the backend refactoring items and allow this to proceed as-is.
Does that sound like a workable plan?
cc @bikebilly
Edited by Joshua Lambertadded 1 commit
- a1d10bee - use find_or_initialize to fetch prometheus_service instance
added 1 commit
- e71a27f0 - fix tests for find or initialize service 'prometheus'
added 351 commits
-
e71a27f0...e776096e - 350 commits from branch
master
- 47f2754a - Merge branch 'master' into pawel/connect_to_prometheus_through_proxy-30480
-
e71a27f0...e776096e - 350 commits from branch
added 1 commit
- eac8ad6a - add i8n to the Prometheus integration settings page
I've resolved the conflicts and I've added I8n to the integration page based on @jivanvl's feedback. I think this is ready for a final review. @pchojnacki what do you think?
mentioned in issue #42819 (moved)
LGTM on the i18n side of things, will leave the
_.escape
comment for another issue here:@rspeicher Can you do a final backend review here?
assigned to @rspeicher
mentioned in issue #42820 (closed)
I created https://gitlab.com/gitlab-org/gitlab-ce/issues/42820 to address the refactoring comments, and set to %10.6.
- Resolved by Robert Speicher
- Resolved by Robert Speicher
- Resolved by Robert Speicher
- Resolved by Robert Speicher
- Resolved by Robert Speicher
- Resolved by Robert Speicher
- Resolved by Robert Speicher
- Resolved by Robert Speicher
- Resolved by Robert Speicher
- Resolved by Robert Speicher
- Resolved by Robert Speicher
- Resolved by Robert Speicher
Thanks @rspeicher, I've addressed most comments. Backend looks a little bit nicer now, and it now will catch cases where server unexpectedly returns non-json responses.
assigned to @rspeicher
Since @rspeicher should be available quite soon, I'm not reassigning this to local reviewer.
mentioned in issue #42850 (closed)
mentioned in issue #42852 (closed)
mentioned in commit 1b748440
mentioned in merge request !17336 (merged)
mentioned in merge request !17377 (merged)
added devopsmonitor label