gitlab-ce~992792 would be to also provide a series for requested CPU/Memory as well.
Pod limit (currently 100 per Node)
100% utilization in this case would be all Cores/RAM available within the cluster. (Summarization of the Node CPU/RAM)
If we can achieve this, we will be delivering more value than what can be obtained easily via the k8s console.
Proposal
We can gather the % used today, but to gather the amount requested we will need to deploy kube-state-metrics.
The general actions:
Deploy kube-state-metrics along with Prometheus for managed clusters. If a cluster already had Prometheus deployed, we should deploy this as well. We will need to think about the scaling needs.
Add the charts to the cluster management page. Will need UX input on this.
@sarrahvesselov can we get someone assigned from gitlab-ce~2024184 to work with on this? This is a must-deliver for %10.6, and we'd ideally get some updated mockups to work from based on gitlab-ce#27890. The main updates:
Can you take a look at this asap @dimitrieh? It is a high-priority issue that needs UX eyes fast so that FE and BE can take action. Please let me know if you don't have the capacity. Thanks!
Dug into this a little more today, here are some findings:
kube-state-metrics provides a wealth of data, but most relevant for this feature is Node metrics and Pod metrics. This will provide available memory and cores for each node, as well as requested resources.
It is relatively easy to determine total CPU usage and Memory usage, with cAdvisor data.
From there, we should be able to leverage:
The Prometheus deployment and k8s proxy integration we are building in 10.5. (Note we will need to flag kube-state-metrics as on in the Helm chart.
The Prometheus adapter on the backend to generate Prometheus queries and hand the results off to the FE.
The existing d3 based charts we have built for our APM service.
@joshlambert If it's less effort, then it's certainly a good consideration! My initial assumption was that whatever the k8s dashboard was using would be easier, but it sounds like that might not be true and there are enough pieces to deliver the same or better data. And alerting... that seems like a great value-add to justify it right there. We'd have to feed the k8s API data somewhere to constantly monitor and that sounds a lot like Prometheus...
Obvious downside is that the user needs to install Prometheus. In the 10.6 timeframe, we'll at least make that easy, but it's not guaranteed they'll have it installed. Of course we can show a nice call to action to encourage it if they want to see node data.
This should be reasonable to implement with kube-state-metrics and container metrics from the cluster.
Since this is going to be an overview, and clusters could be very large, we will want to implement recording rules to summarize the cluster-wide metrics.
we are talking purely about showing complete cluster health correct? not application specific
@dimitrieh that is correct. This will show CPU/Memory consumed, available, and optionally requested for the cluster. There could be any number of applications running on it.
Two additional asks if you can:
Can you mock up an empty state, in the event Prometheus hasn't been deployed to the cluster? This would serve to tell people why metrics aren't showing yet, and direct them how to install it. (Should be on the same page, just further down, maybe an anchor link?)
Can you also do a mockup of the chart with three series? This is a stretch goal, but would be great to get a mockup ready in the event we can deliver. I think the main action items would be simply another line on the chart, and then a simple legend to indicate what these values mean.
CPU/Memory consumed
CPU/Memory requested (this is how much of the resource pods have asked for, which is distinct from what they use)
CPU/Memory available (this would just be the link, as you note)
@pchojnacki can you take a look at this issue and provide feedback?
In particular to unblock gitlab-ce3412464, is there any gitlab-ce24926493 work required first?
I think we can manually create a cluster using the Prometheus Helm Chart with kube-state-metrics enabled, so that should not be a problem. We'd just connect it for testing.
What about any gitlab-ce~24926493 work required though to handle the queries? Are these defined in the FE and are then just passed through the Prometheus adapter?
@joshlambert by adding appropriate queries into additional_metrics.yml some inital FE work can be unblocked. However, eventually, backend changes are needed to provide a clean interface, as we don't yet have any cluster-wide metrics, currently, we've only focused on per-environment/deployment metrics.
Thanks for the clarification @pchojnacki + @joshlambert. Its assigned to @mikegreiling's pipeline for 10.6 as he is already working in that area right now.
Can you mock up an empty state, in the event Prometheus hasn't been deployed to the cluster? This would serve to tell people why metrics aren't showing yet, and direct them how to install it. (Should be on the same page, just further down, maybe an anchor link?)
@joshlambert this mockup is already in the description of this issue? The button can directly install the Prometheus service and acts as a second "install" button on the same page, part of the empty state.
Can you also do a mockup of the chart with three series? This is a stretch goal, but would be great to get a mockup ready in the event we can deliver. I think the main action items would be simply another line on the chart, and then a simple legend to indicate what these values mean.
CPU/Memory consumed
CPU/Memory requested (this is how much of the resource pods have asked for, which is distinct from what they use)
CPU/Memory available (this would just be the link, as you note)
I imagine it be something like this:
With an additional dynamic user controlled legend functioning similar to
Dimitrie Hoekstraadded gitlab-ce2024186 and removed gitlab-ce2024184 labels
added gitlab-ce2024186 and removed gitlab-ce2024184 labels
Discussed this on the team call today, and we want to use a single charting component for both this screen and the environment dashboard.
This will help us to focus our efforts on a common solution:
Increase user familiarity so we don't have different styles across GitLab
Reduce initial up front and on-going maintenance
Improvements can be shared easily across all areas
@mikegreiling has been trying to adapt the current component to meet the above UX exactly, and is not possible. Rather than re-develop an entirely new component, we will re-use the existing chart and post a screenshot of the result.
This approach also reduces risk for 10.6, as we have committed to shipping this item. Discussed this briefly with @tauriedavis, as @dimitrieh is currently out.
@pchojnacki & @mikegreiling, sorry for the conflicting signals. Hopefully this is easy to just target for EE, since the upstream merge had to be successful anyway?
@tauriedavis@dimitrieh - here is the current UX . Note that this is using our existing charting component that we have already built, to leverage that for both speed and efficiency going forward.
If you have concerns on using that, please let me know, but I think we'd want consistency here across the workflows.
The graphs Dimitrie made look really nice and we may want to consider making another issue to improve the existing UI but I think using our current charting component for this issue is a good idea. I believe @dimitrieh is back, so will leave final call to him
we may want to consider making another issue to improve the existing UI but I think using our current charting component for this issue is a good idea.
I agree here. Using our current charting component for this issue and creating an issue to iterate and get the UI closer to @dimitrieh mocks will help us move faster.
I agree here. Using our current charting component for this issue and creating an issue to iterate and get the UI closer to @dimitrieh mocks will help us move faster.
Yes! Let gets this reviewed and merged ASAP especially since this is a committed deliverable for %10.6. It has to ship.
@dimitrieh would you be willing to put together an issue with some ideas on how to unify the UX you had designed for this feature, with the larger environment dashboard?
One quick example to me would be to remove the line breaks around the charting component, and moving that into the outer page itself. This would allow us to embed just a chart or two without the extra breaks which aren't necessary in smaller use cases.
As for the charts, I recon that we need some solidified rulesets on how to display all kinds of different charts correct?
Is there some existing documentation on what the extent/capabilities of the existing charting component are?
We need this figured out anyway, as we have charts all around the interface and will continue to have more now that we are exploring/growing the dev ops side.
Is there some existing documentation on what the extent/capabilities of the existing charting component are?
@dimitrieh I know I haven't myself written up specific requirements around the charting component we use in the metrics dashboards. We have been coming at them from the perspective of monitoring user stories driving functionality, rather than building specific functional requirements for the chart itself.
I'll work on moving my MR from CE to EE, put it behind a feature flag for EE Ultimate, and polish it up for review. After that try to get in as much extra polish and UI harmonizing as possible before the 7th
@joshlambert great I propose we are going to set some guidelines and rules with an issue in the GitLab design repository in order to approach this a bit more hollistically.
@joshlambert have we looked at any existing charting component libraries we can leverage which have a lot of these rulesets already defined?
@joshlambert have we looked at any existing charting component libraries we can leverage which have a lot of these rulesets already defined?
@dimitrieh we are using d3 for the charting library. I don't know the details, but it seems super flexible but requires a lot of effort.
While I think it's great that we take a more holistic view of where we want to end up with the charting component, I want to ensure we approach iterating on it based on the value we are attempting to deliver in a given release.
gitlab-design#123 (closed) is less about the implementation and more about the rules charts should adhere to throughout GitLab. That way not every chart feels like an implementation of its own and they will become easier to read.
@dimitrieh I'd like to do a focused refactoring of the graph library so that we can be more flexible on things like colors, legends, measures, axes, etc while still reusing the same components and not reinventing the wheel whenever we need to add another graph. There's also the burndown chart graph which could maybe be harmonized with these components as well (similar axes, colors, hover states, etc).
Should we open a gitlab-ce issue to discuss this, or should we keep the discussion in gitlab-design#123 (closed) for now?
There's also a good deal of ~"technical debt" in the current implementation (I don't like the way it handles scaling to different screen sizes at all). I think we ought to focus on fixing that to more closely resemble the designs as well.
I don't know what resources we have available for implementation, but based on that we could potentially prioritise that issue. cc: @joshlambert@sarrahvesselov