Skip to content
Snippets Groups Projects

Deploy prometheus through kubernetes and autoconnect to cluster

Merged Paweł Chojnacki requested to merge pawel/connect_to_prometheus_through_proxy-30480 into master
All threads resolved!

What does this MR do?

  • Configures the Prometheus service automatically when installed through Kubernetes cluster integration.
  • Updates UX to suggest installing via cluster integration in the environment monitoring "getting-started" page.
  • Adds link from cluster prometheus app install section to the prometheus integration config.
  • Updates prometheus service page to indicate whether it is automatically or manually configured

Are there points in the code the reviewer needs to double check?

Why was this MR needed?

Screenshots (if relevant)

monitoring getting-started page cluster applications page

| monitoring-getting-started | cluster-app-manage-link |

service page (no config) service page (automatic config) service page (manual config)
service-no-config service-auto-config service-manual-config

Does this MR meet the acceptance criteria?

What are the relevant issue numbers?

Closes: #30480 (closed), #28916 (closed)

Edited by Robert Speicher

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Shinya Maeda
  • Shinya Maeda
  • @pchojnacki I reviewed this from the point of view of multiple clusters.

    There are many back-and-forth in https://gitlab.com/gitlab-org/gitlab-ce/issues/28916 and I don't fully understand what the final specification of this MR, though, I assumed we wanted to keep Cluster(Prometehus) and Environment having 1-to-1 relationship.

    Untitled__7_

    Although, the current implementation is Cluster(Prometehus) and Environment having many-to-1 relationship.

    Untitled_1_

    This is only my concern. Feel free to ignore if this is intended.

    /cc @ayufan

  • @dosuken123 @pchojnacki

    I'm worried that PrometheusService starts to have multiple responsibilities and thus there's a big complexity connected with that. Currently, we have quite a strange situation where:

    1. We have Cluster Prometheus application which provides a configuration,
    2. We have Prometheus Service which is provider and consumer of configuration, it provides a manual configuration or performs a cross-join for configuration (from PrometheusService -> Project -> Clusters::Cluster -> Cluster::Platforms::Kubernetes -> Cluster::Applications::Prometheus) that is environment dependent. Technically it makes Prometheus Service have an unlimited amount of different configurations. This makes me worried that since Prometheus Application is not aware of Prometheus Service and all the business logic related to metrics we might find in the future some strange edge cases with support for multiple clusters, and data caching and invalidation.

    This makes me slightly worried about the complexity introduced by this MR :) I wonder if we can split all the logic for getting metrics into a different entity, and make a decision what provider we use at a higher level than Service? That way we would be able to make PrometheusService be very thin, and not dependent on Cluster::Applications::Prometheus, instead make this decision in another place.

  • Who should I assign next?

  • I can't really comment on the best engineering practices, but we do need to ensure this makes it into %10.5.

    So if it means some level of tech debt but we still ship, that seems OK to me for now. We can always improve the architecture and design in future releases. If the re-engineering work is relatively straight forward, let's get it done.

    @pchojnacki what are your thoughts on effort level to address @ayufan and @dosuken123's comments?

  • In my testing Prometheus failed to install. @joshlambert believed this was related to gitlab-org/gitlab-ce#42496 so I'm going to try applying the changes in !16749 (merged) locally and re-testing.

    Edited by Mike Greiling
  • Paweł Chojnacki added 580 commits

    added 580 commits

    • 4b1d42bb...44edd111 - 579 commits from branch master
    • f87afeff - Merge remote-tracking branch 'upstream/master' into pawel/connect_to_prometheus_through_proxy-30480

    Compare with previous version

  • Mike Greiling mentioned in merge request !16777 (merged)

    mentioned in merge request !16777 (merged)

  • @pchojnacki it looks like we have some build failures to sort out

  • Mike Greiling added 2 commits

    added 2 commits

    • 25e13458 - Fix values.yaml for Prometheus
    • 81384a6d - Merge branch '42496-prometheus-deployment-fails-after-helm-chart-upgrade-to-2-1'…

    Compare with previous version

  • Paweł Chojnacki mentioned in merge request !16786 (closed)

    mentioned in merge request !16786 (closed)

  • @joshlambert I've explored a bit how @ayufan proposed code layout would look like and while I can see some benefits, I think the timeframe given is insufficient to refactor the code and adjust the tests as well as pass the code review.

    As for the complexity PrometheusService has around 160 lines of code. Where it does, however, add up its the current duplication of finding the right cluster for given environment. Where 'deployment_platform' looks for last available Kubernetes platform, the above code looks for first cluster with installed Prometheus.

    Would be good to clarify some edge cases here as well. I.e. what if we have two clusters one with environment_scope = '*' and the other with environment_scope = 'dev' which cluster should provide metrics for dev environment?

  • Although, the current implementation is Cluster(Prometehus) and Environment having many-to-1 relationship.

    @dosuken123 At the time of writing this MR there was no way to directly link single environment with specific cluster.

    The best algo I came up with was to match environment_slug with environment_scope with special case where * matches all slugs. Then I pick the cluster with longest environment_scope so that its the most specific match. When implementing that I had in mind the case where actual wildcards could be used in environment_scope similarily to File.fnmatch

    Its one of the things I was interested in discussing in this MR (I could have pointed to that directly though).

    So effectively this MR maintains 1-1 relation however there are some edge cases - like single letter environments.

  • I've explored a bit how @ayufan proposed code layout would look like and while I can see some benefits, I think the timeframe given is insufficient to refactor the code and adjust the tests as well as pass the code review.

    Are we OK with proceeding with the existing implementation so we can still hit %10.5? This is a must deliver for %10.6, and we already have to deliver the cluster monitoring solution in that release.

    Would be good to clarify some edge cases here as well. I.e. what if we have two clusters one with environment_scope = '*' and the other with environment_scope = 'dev' which cluster should provide metrics for dev environment?

    @ayufan @bikebilly have we identified the logic for this? Or is it currently a recommendation of "Don't do that".

  • added 1 commit

    Compare with previous version

  • added 1 commit

    • 89eb6222 - enable manual configuration property for all test prometheus services

    Compare with previous version

  • added 1 commit

    Compare with previous version

  • I wonder if we can split all the logic for getting metrics into a different entity, and make a decision what provider we use at a higher level than Service?

    I agree with this. I wonder if we can only use project.deployment_service(env) for now, instead of looking for an available prometheus instance greedily. This cross reference for prometheus should be maintained under the group-level cluster or something similer. This way reduces the complexity and it's always searched/connected from top to bottom.

  • I'm going to assign this back to the author, because there are unresolved discussions. @dosuken123 can you review this merge request from the CI/CD perspective once your comments are addressed? Thanks!

  • Grzegorz Bizon assigned to @pchojnacki

    assigned to @pchojnacki

  • Mike Greiling added 2 commits

    added 2 commits

    • d2688b28 - add manage button to application rows in cluster integration
    • 40d93ce6 - add button to configure prometheus integration from cluster page

    Compare with previous version

  • Mike Greiling added 1 commit

    added 1 commit

    • 39d5adde - internationalize the "Manage" button label

    Compare with previous version

  • Would be good to clarify some edge cases here as well. I.e. what if we have two clusters one with environment_scope = '*' and the other with environment_scope = 'dev' which cluster should provide metrics for dev environment?

    @ayufan @bikebilly have we identified the logic for this? Or is it currently a recommendation of "Don't do that".

    @joshlambert @pchojnacki I think the problem is that we cannot predict or force the user to install prometheus in the "proper" cluster. The simplest idea is to consider the best-matching cluster (based on environment scope) that also has Prometheus available:

    • environment is production
    • we have a cluster with env scope production
      • if this has Prometheus available, use it
      • if not, continue
    • we have a cluster with env scope *
      • if this has Prometheus available, use it
      • if not, continue
    • if we still don't have a matching cluster, give up

    This has some advantages:

    • this is probably what a user may expect to have
    • you can leverage a specific cluster for a specific env
    • you can leverage a "catchall" cluster without installing one specifically for each environment
    • you can choose to provide metrics only for some environment, not for others (edge case)

    Disadvantages:

    • if you have a cluster with a Prometheus server and with a specific scope, you cannot use it as a "catchall" (but it looks quite weird to me that a cluster specifically with a scope will be used also by other environments for metrics, since we are defining scope as a cluster property not strictly related to deployments only)

    We can heavily suggest to install Prometheus in the * cluster since it is probably the standard case, and will ensure covering every environment that doesn't have a specific one.

    What do you think? I don't have visibility on the engineering implementation, so this is subject to effort/complexity evaluation on that side. I totally agree we should have it done by %10.5 if possible and then iterate.

    cc @ayufan

  • @bikebilly what behavior does the existing features utilize to handle this situation currently? For example if I have a cluster with * and production, what cluster is used in the CI environment variables for jobs?

    In looking at the documentation, it's not really spelled out what the options are here. For example, can I do something like production, staging? Is something like review/* an option?

    In isolation I would recommend:

    1. Check to see which clusters match the environment in question
    2. In the event there are multiple, select the cluster with the most specific environment definition (Logic could be: No wildcard, then longest wildcard)
    3. For the selected Cluster, check to see if Prometheus has been installed
    4. If not, show the empty state.
  • Mike Greiling added 168 commits

    added 168 commits

    • e25f383d...560c93e6 - 167 commits from branch master
    • 3a565d5d - Merge branch 'master' into pawel/connect_to_prometheus_through_proxy-30480

    Compare with previous version

  • Mike Greiling added 1 commit

    added 1 commit

    • d9c8c995 - add link to cluster configuration from prometheus "getting started" state

    Compare with previous version

  • Mike Greiling added 1 commit

    added 1 commit

    • 0b47134e - change prometheus service description text to match design

    Compare with previous version

  • Mike Greiling added 2 commits

    added 2 commits

    • 310ae687 - hide prometheus manual integration form when auto config is active
    • 06e7c157 - style the prometheus integration service page according to design

    Compare with previous version

  • I believe I've completed the frontend work for this feature and it is now ready for review.

    @pedroms would you like to take a look at it?

    Here are some screen shots:

    monitoring getting-started page cluster applications page

    | monitoring-getting-started | cluster-app-manage-link |

    service page (no config) service page (automatic config) service page (manual config)
    service-no-config service-auto-config service-manual-config
  • assigned to @pedroms

  • Mike Greiling marked the checklist item Changelog entry added, if necessary as completed

    marked the checklist item Changelog entry added, if necessary as completed

  • Mike Greiling added 1 commit

    added 1 commit

    • bb79e3cb - fix broken karma specs due to removed function

    Compare with previous version

  • Mike Greiling added 1 commit

    added 1 commit

    • 9cdec947 - fix rubocop errors and ensure prometheus_installed? doesn't fail in CI tests where project is nil

    Compare with previous version

  • Mike Greiling unmarked as a Work In Progress

    unmarked as a Work In Progress

  • By the way I am fine with the logic noted by @pchojnacki here: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/16182#note_56661324

    We can adjust as needed in the future, but should proceed with this merge. For nearly all use cases it will work as intended until we get a clearer picture on how it is implemented elsewhere.

  • Ping @pedroms for UX review.

    @dosuken123 and @pchojnacki can you resolve these open diff discussions? We have two days to get this merged. Who else needs to sign off on the backend ?

    @jivanvl can you review the frontend changes?

  • @sarrahvesselov, it looks like @pedroms may be at FOSDEM. This is a critical MR to ship in 10.5, is there another UX person who could take a review? I imagine unless there are showstoppers, we'd make follow up issues at this point in time.

  • Jose Ivan Vargas
  • While there's no javascript tests (and the current ones got two scenarios removed from them). I don't think the changes are that significant to warrant the use of tests at the moment.

    cc @mikegreiling

  • Just tested this locally and it works really well 😄

  • @mikegreiling Just left you some comments regarding i18n and some _.escape things, other than that LGTM, should be good to go to a maintainer once those comments are taken care of

    Edited by Jose Ivan Vargas
  • Pedro Moreira da Silva marked the checklist item Has been reviewed by UX as completed

    marked the checklist item Has been reviewed by UX as completed

  • @mikegreiling @joshlambert it looks great from the screenshots. Setting this up locally is a pretty lengthy process, so I'm trusting @mikegreiling's screenshots (thanks btw! 🙇). Marking UX reviewed…

  • Thank you @pedroms and @jivanvl for reviewing. I'm going to work on resolving @jivanvl's notes, but in the mean time I'm going to assign to @pchojnacki to continue resolving whatever is needed to get this approved for the backend changes.

  • Mike Greiling assigned to @pchojnacki

    assigned to @pchojnacki

  • @ayufan @dosuken123, since this is a critical item to deliver in %10.5 I'd like to suggest that we create a follow up issue for the backend refactoring items and allow this to proceed as-is.

    Does that sound like a workable plan?

    cc @bikebilly

    Edited by Joshua Lambert
  • added 1 commit

    Compare with previous version

  • added 1 commit

    • a1d10bee - use find_or_initialize to fetch prometheus_service instance

    Compare with previous version

  • added 1 commit

    • e71a27f0 - fix tests for find or initialize service 'prometheus'

    Compare with previous version

  • @pchojnacki have you looked into the EE conflicts? it looks like the compat-check job is failing now

  • Actually it looks like this branch itself has conflicts with master... the EE conflicts noted in CI are probably the same conflicts. I'll try merging master and see how it goes.

  • Mike Greiling added 351 commits

    added 351 commits

    • e71a27f0...e776096e - 350 commits from branch master
    • 47f2754a - Merge branch 'master' into pawel/connect_to_prometheus_through_proxy-30480

    Compare with previous version

  • Mike Greiling added 1 commit

    added 1 commit

    • eac8ad6a - add i8n to the Prometheus integration settings page

    Compare with previous version

  • I've resolved the conflicts and I've added I8n to the integration page based on @jivanvl's feedback. I think this is ready for a final review. @pchojnacki what do you think?

  • mentioned in issue #42819 (moved)

  • LGTM on the i18n side of things, will leave the _.escape comment for another issue here:

    https://gitlab.com/gitlab-org/gitlab-ce/issues/42819

  • @rspeicher Can you do a final backend review here?

  • assigned to @rspeicher

  • Joshua Lambert resolved all discussions

    resolved all discussions

  • mentioned in issue #42820 (closed)

  • I created https://gitlab.com/gitlab-org/gitlab-ce/issues/42820 to address the refactoring comments, and set to %10.6.

  • The EE conflicts ended up being related to gitlab-org/gitlab-ce!16937 which existed in CE-master, but not EE-master, so I created an EE branch which cherry-picks these changes in addition to our own changes. It should pass now.

  • Frontend Code LGTM!

  • Tim Zallmann marked the checklist item Has been reviewed by Frontend as completed

    marked the checklist item Has been reviewed by Frontend as completed

  • Robert Speicher
  • Robert Speicher
  • Robert Speicher
  • Robert Speicher
  • @pchojnacki Couple questions. Feel free to bounce to a maintainer in your timezone if changes are necessitated and we're under a time crunch.

  • Robert Speicher assigned to @pchojnacki

    assigned to @pchojnacki

  • Paweł Chojnacki added 2 commits

    added 2 commits

    • 277f7fef - Make prometheus service querying approach much nicer wrt to arity and default function params
    • 8cb105cf - Fix order of checks in editable? method.

    Compare with previous version

  • added 1 commit

    Compare with previous version

  • added 1 commit

    • 0e90284c - Catch json parsing error as PrometheusError

    Compare with previous version

  • Thanks @rspeicher, I've addressed most comments. Backend looks a little bit nicer now, and it now will catch cases where server unexpectedly returns non-json responses.

  • Since @rspeicher should be available quite soon, I'm not reassigning this to local reviewer.

  • Robert Speicher resolved all discussions

    resolved all discussions

  • Robert Speicher approved this merge request

    approved this merge request

  • Robert Speicher marked the checklist item Has been reviewed by Backend as completed

    marked the checklist item Has been reviewed by Backend as completed

  • yay 🎉

  • mentioned in issue #42850 (closed)

  • mentioned in issue #42852 (closed)

  • Robert Speicher mentioned in commit 1b748440

    mentioned in commit 1b748440

  • Paweł Chojnacki mentioned in merge request !17336 (merged)

    mentioned in merge request !17336 (merged)

  • Mayra Cabrera mentioned in merge request !17377 (merged)

    mentioned in merge request !17377 (merged)

  • Please register or sign in to reply
    Loading