For some projects, the Security & Compliance -> Threat Monitoring -> Statistics page fails to load the statistics chart for Container Network Security, even when Prometheus and Cilium are installed via GMAv2.
It seems to be that there is no result (backend returns an empty object with a 202) and the frontend is configured to keep polling the endpoint until it gets a 200, so it never stops loading. I wonder if that is because in the backend the request for this data is async so it is waiting for a response.
Must be a configuration issue elsewhere? 🤷 I used managed apps v1 for https://staging.gitlab.com/defend-team-test/cnp-alert-demo 's cluster (checked the box saying to install Prometheus instead of adding it to the config.yml) and that seemed to work.
Example Project
What is the current bug behavior?
The chart gets stuck in a loading state and never comes out of the loading state.
What is the expected correct behavior?
The chart should load in a reasonable timeframe. If no data is available, then we should display an empty chart.
Relevant logs and/or screenshots
Output of checks
Results of GitLab environment info
Expand for output related to GitLab environment info
(For installations with omnibus-gitlab package run and paste the output of:
`sudo gitlab-rake gitlab:env:info`)
(For installations from source run and paste the output of:
`sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)
Results of GitLab application Check
Expand for output related to the GitLab application check
(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true)
(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true)
(we will only investigate if the tests are passing)
Possible fixes
Designs
An error occurred while loading designs. Please try again.
Child items
0
Show closed items
GraphQL error: The resource that you are attempting to access does not exist or you don't have permission to perform this action
No child items are currently open.
Linked items
0
Link issues together to show that they're related or that one is blocking others.
Learn more.
I'm having some difficulties understanding the internals of the PrometheusAdapter but this seems like it could happen if the Prometheus integration is working but isn't returning the metrics we are trying to query. CC: @gitlab-org/protect/container-security-backend in case anybody has a better understanding of how this works.
@aturinske@sam.white I think we probably ought to make a change on the frontend to show a message if we can't get any data after a certain number of tries. Something like "Statistics aren't available yet", but we could continue trying to query for them in the background.
we probably ought to make a change on the frontend to show a message if we can't get any data after a certain number of tries. Something like "Statistics aren't available yet", but we could continue trying to query for them in the background.
That sounds like a good solution. It would be good to understand the root of what is causing the error so we can give users instructions on how to resolve it. Right now any error message text we could provide would be unlikely to be helpful.
I'm having some difficulties understanding the internals of the PrometheusAdapter but this seems like it could happen if the Prometheus integration is working but isn't returning the metrics we are trying to query.
@bwill I agree, this looks like this is the case. We are using ReactiveCache in this case, so getting metrics from the prometheus adapter is happening in the background. In the controller we are not responding with error message when it is impossible to get metrics from the adapter, but responding with 400 Bad Request error with no explanation. We need to verify if this adapter will respond with error when metrics are not found and then respond with error message in the response:
defsummaryreturnnot_foundunlessenvironment.has_metrics?adapter=environment.prometheus_adapterreturnnot_foundunlessadapter.can_query?result=adapter.query(:packet_flow,environment.deployment_namespace,params[:interval]||"minute",parse_time(params[:from],1.hour.ago).to_s,parse_time(params[:to],Time.current).to_s)respond_todo|format|format.jsondoifresult.nil?# background calculation has not finished yetrenderstatus: :accepted,json: {}elsifresult[:success]renderstatus: :ok,json: result[:data]elserenderstatus: :bad_request,json: {message: result[:result]}endendendend
Is your expectation that we're going to continue supporting this integration until 15.0 or should we focus only on the agent integration? I don't remember our deprecation rules off the top of my head.
In any case, Dominic has accepted the challenge and will give it a go.
@thiagocsf that is a good point; however, unfortunately, as far as I know, we do not have an alternative to using GMAv2 with Cilium for our Container Network Security category.
If this is possible through the agent (including generating and viewing the Statistics page) then I would be all too happy to move over to that. If that is possible we don't have it documented.
If this is not possible through the agent, then I believe we should continue to support this integration as we do not have an alternative to replace it with. We will probably need to invest in moving off GMAv2 and over to the agent by 15.0 if support for GMAv2 is actually going to be removed in that milestone.
tl;dr I think this affects all installation options equally
My understanding is that there are at least two hard requirements for the current Threat Monitoring dashboard:
A Prometheus service within the cluster:
Prometheus must be installed in your cluster in the gitlab-managed-apps namespace.
The Service resource for Prometheus must be named prometheus-prometheus-server.
-- Cluster integrations/Prometheus Prerequisites
The cluster must use the Cilium CNI and its Hubble observability services1.
Then it should not matter how these requirements get fulfilled -- whether Prometheus and Hubble get installed as Helm releases via GMAv2 or if their manifests get synced by an Agent should be irrelevant.
This appears less well defined and I have yet to figure out how Cilium/Hubble integration gets discovered by GitLab instances. ↩
The pipeline succeeds. All deployed pods are in running state.
Navigate to Infrastructure > Kubernetes Clusters > Integrations and check Enable Prometheus integration.
Now when I navigate to the Threat Monitoring > Statistics tab, I see the same loading state for a couple of seconds. Then the UI turns into an empty state.
I now understand why my dashboard didn't render at first.
The dashboard renders only if network activity within the environment's deployment namespace has been recorded.
The backend queries a Prometheus metric by the deployment namespace label, e.g.:
Without network activity in the namespace, no data points match (even though the hubble_flows_processed_total metric exists) and leads to my response from above (with status code 200):
I tried to replicate what could potentially go wrong with Prometheus/Hubble integrations, as my GMAv2 installation was functional.
I injected faults into the Prometheus installation and observed how the dashboard behaved in response.
I found a class of errors that break the background computation which the dashboard relies on. Namely, unhandled network errors. MR that rectifies this.
I also saw background workers irregularly not executing as expected, independent of Prometheus state. I suspect there is Sidekiq middleware interfering, but did not find the cause. I think this is the bigger problem, assuming my local Sidekiq behaves comparably to the production one.
How I could reproduce the error
The dashboard keeps loading indefinitely if the backend's summary.json continuously responds with 202. The backend enqueues a worker to query Prometheus in the background. Until the query result is available, it responds with 202.
There are three cases where the response code never changes from 202:
No background worker executes and Prometheus is not queried.
A background worker gets enqueued but dies because of an unhandled exception.
A background worker executes successfully, but it stores nil in the Rails cache as a result.
I could reliably trigger (2) and suspect there is also a trigger for (1) hidden somewhere in Sidekiq middleware. I found no path leading to (3).
With (2), the problem is that Gitlab::PrometheusClient uses Gitlab::HTTP internally. However the async path of PrometheusAdapter does not rescue from all possible network errors in Gitlab::HTTP::HTTP_ERRORS. For example, when encountering persistent HTTP timeouts, the background worker raises, gets retried three times, and dies.
I simulated HTTP client timeouts by shelling into one of the Kubernetes VMs and pausing the Prometheus process:
… and sent the dashboard into the loading animation loop.
Sometimes, the worker doesn't execute
I noticed that irregularly, independent of what I did to Prometheus, no ExternalServiceReactiveCachingWorker would get executed by Sidekiq. I could see the workers in Sidekiq's "Scheduled" tab. However from the logs I added, I could tell these workers did not execute.
This caused the dashboard not to recover from failure:
I introduced a fault.
I removed the fault.
The dashboard remained in its loading animation loop.
I suspect there's some Sidekiq middleware which drops the worker jobs prior to execution
How I could not reproduce the error
The following faults did not result in a loading animation loop.
Prometheus service does not exist
Details
GMA might have failed to install the Prometheus release, or might have installed it into a wrong namespace.
Prometheus service exists, but Cilium timeseries not present
Details
The hubble_flows_processed_total metric might be missing from Prometheus because the Hubble exporter is dysfunctional, or because Prometheus fails to scrape it.
Faulted how: Renamed the metric in PacketFlowQuery.
Backend status code: 200
Prometheus service forwards to wrong container port
Details
I have no idea how that would occur but tried either way.
Faulted how: Changed Prometheus servicePort from 9090 to 9091
Backend response code: 400
Possible next steps
Investigate error tracking: There should be a pattern where some ExternalServiceReactiveCachingWorker fails continuously if e.g. timeouts are the primary cause.
Investigate the Sidekiq behaviour I don't understand.
Thanks @bauerdominic it looks like you are getting really deep into the inner workings here! I did notice that https://staging.gitlab.com/defend-team-test/cnp-alert-demo/-/threat_monitoring is now loading the Statistics tab again. I don't know what changed or how, but unfortunately it means that we no longer have a case of this actively erroring out anymore. In any case, thanks for the deep troubleshooting. It sounds like you are on the right track to finding the root problem here.
What I'm trying to reproduce is the dashboard entering (1) and then never transitioning to another state. This results in a loading animation loop. It's also what's visible in the OP screenshot.
Without network activity recorded in the namespace, the Prometheus query still succeeds and returns a 200 response, but all metrics in the response are 0. The dashboard ends up in the empty state (2), not in the loading animation state (1).
My suspicion that Sidekiq middleware interferes with the job turned out to be wrong. Instead, I found the error to be local to my development environment. When I add debug output to the ReactiveCacheableWorker module, this breaks code reloading. The background jobs I was missing were stuck in "Busy" state, blocking on the same code (as verified by a Sidekiq thread dump). This concerned all jobs equally and was not limited to the caching worker.
I connected my existing cluster to a hosted project on GitLab.com, and the dashboard behaved as expected, too. Because stuck workers would raise alerts in production environments, too, I would by now also rule out possibility (1):
No background worker executes and Prometheus is not queried.
A background worker gets enqueued but dies because of an unhandled exception.
A background worker executes successfully, but it stores nil in the Rails cache as a result.
This would leave (3) for which I still cannot find any possible code paths.
Then I'm again left with (2), which I could reproduce in the case of network failures.
Given that the cnp-alert-demo dashboard also does not exhibit the infinite loading behaviour anymore, I'd conclude this for now until my MR made it into production and the behaviour is observable in some real-world cluster again.
@thiagocsf I wouldn't mind verifying it except that the project I was using that encountered this error suddenly started working again! I would be happy to verify that it is still working, but I won't be able to verify that the error state is caught properly.