Statistics Page not Loading

added Category:Container Network Security Enterprise Edition GitLab Ultimate devopsgovern groupsecurity policies sectionsec typebug labels

added backend workflowready for development labels

mentioned in issue #325207 (closed)

changed milestone to %14.5

added priority2 severity2 workflowrefinement labels and removed workflowready for development label

@mparuszewski, could you please refine this issue?

Bot policy.

assigned to @mparuszewski

unassigned @mparuszewski

@bwill, could you please refine this issue?

Bot policy.

assigned to @bwill

The summary endpoint returns 202 if the query result is nil.

I'm having some difficulties understanding the internals of the PrometheusAdapter but this seems like it could happen if the Prometheus integration is working but isn't returning the metrics we are trying to query. CC: @gitlab-org/protect/container-security-backend in case anybody has a better understanding of how this works.

@aturinske @sam.white I think we probably ought to make a change on the frontend to show a message if we can't get any data after a certain number of tries. Something like "Statistics aren't available yet", but we could continue trying to query for them in the background.

we probably ought to make a change on the frontend to show a message if we can't get any data after a certain number of tries. Something like "Statistics aren't available yet", but we could continue trying to query for them in the background.

That sounds like a good solution. It would be good to understand the root of what is causing the error so we can give users instructions on how to resolve it. Right now any error message text we could provide would be unlikely to be helpful.

I'm having some difficulties understanding the internals of the PrometheusAdapter but this seems like it could happen if the Prometheus integration is working but isn't returning the metrics we are trying to query.

@bwill I agree, this looks like this is the case. We are using ReactiveCache in this case, so getting metrics from the prometheus adapter is happening in the background. In the controller we are not responding with error message when it is impossible to get metrics from the adapter, but responding with 400 Bad Request error with no explanation. We need to verify if this adapter will respond with error when metrics are not found and then respond with error message in the response:

      def summary
        return not_found unless environment.has_metrics?

        adapter = environment.prometheus_adapter
        return not_found unless adapter.can_query?

        result = adapter.query(
          :packet_flow, environment.deployment_namespace,
          params[:interval] || "minute",
          parse_time(params[:from], 1.hour.ago).to_s,
          parse_time(params[:to], Time.current).to_s
        )

        respond_to do |format|
          format.json do
            if result.nil? # background calculation has not finished yet
              render status: :accepted, json: {}
            elsif result[:success]
              render status: :ok, json: result[:data]
            else
              render status: :bad_request, json: { message: result[:result] }
            end
          end
        end
      end

@bauerdominic, to refine this @bwill would need to setup the whole thing, which makes this one of those bugs that are not worth refining.

Since you're in the process of setting up your environment and finishing the other issue you were working, would you like to take it from him?

@sam.white while discussing this with @bauerdominic, he made a good point: GMAv2 is deprecated.

Is your expectation that we're going to continue supporting this integration until 15.0 or should we focus only on the agent integration? I don't remember our deprecation rules off the top of my head.

In any case, Dominic has accepted the challenge and will give it a go.

@thiagocsf that is a good point; however, unfortunately, as far as I know, we do not have an alternative to using GMAv2 with Cilium for our Container Network Security category.

If this is possible through the agent (including generating and viewing the Statistics page) then I would be all too happy to move over to that. If that is possible we don't have it documented.

If this is not possible through the agent, then I believe we should continue to support this integration as we do not have an alternative to replace it with. We will probably need to invest in moving off GMAv2 and over to the agent by 15.0 if support for GMAv2 is actually going to be removed in that milestone.

tl;dr I think this affects all installation options equally

My understanding is that there are at least two hard requirements for the current Threat Monitoring dashboard:

A Prometheus service within the cluster:

Prometheus must be installed in your cluster in the gitlab-managed-apps namespace. The Service resource for Prometheus must be named prometheus-prometheus-server. -- Cluster integrations/Prometheus Prerequisites
The cluster must use the Cilium CNI and its Hubble observability services¹.

Then it should not matter how these requirements get fulfilled -- whether Prometheus and Hubble get installed as Helm releases via GMAv2 or if their manifests get synced by an Agent should be irrelevant.

This appears less well defined and I have yet to figure out how Cilium/Hubble integration gets discovered by GitLab instances.

added workflowin dev label and removed workflowrefinement label

assigned to @bauerdominic and unassigned @bwill

I ~~can't~~ can reproduce this behaviour.

On a newly created GKE 1.20.10 cluster, I installed Ingress, Prometheus and Cilium via GMAv2 with the following steps:

Created a new local project in my GDK installation which is mirrored publicly here.
Deployed the Kubernetes Agent. It shows up as connected in my local UI and the cluster agent logs no errors.
Added a project-level cluster via certificate.
Committed a minimal GMAv2 config for Ingress, Prometheus and Cilium.
The pipeline succeeds. All deployed pods are in running state.
Navigate to Infrastructure > Kubernetes Clusters > Integrations and check Enable Prometheus integration.

Now when I navigate to the Threat Monitoring > Statistics tab, I see the same loading state for a couple of seconds. Then the UI turns into an empty state.

I now understand why my dashboard didn't render at first.

The dashboard renders only if network activity within the environment's deployment namespace has been recorded. The backend queries a Prometheus metric by the deployment namespace label, e.g.:
```
hubble_flows_processed_total{source="gke-dbauer-test-20-production"}
```
Without network activity in the namespace, no data points match (even though the hubble_flows_processed_total metric exists) and leads to my response from above (with status code 200):
```
{"ops_rate":{"total":[],"drops":[]},"ops_total":{"total":0,"drops":0}}
```
Once I started pinging the outside world from that namespace, there were metrics labelled with the namespace, and the dashboard rendered.
I then broke my Prometheus integration by deleting the Prometheus Kubernetes service. This causes the described 202 request cycle

Good effort, @bauerdominic!

tl;dr

I tried to replicate what could potentially go wrong with Prometheus/Hubble integrations, as my GMAv2 installation was functional.
I injected faults into the Prometheus installation and observed how the dashboard behaved in response.
I found a class of errors that break the background computation which the dashboard relies on. Namely, unhandled network errors. MR that rectifies this.
I also saw background workers irregularly not executing as expected, independent of Prometheus state. I suspect there is Sidekiq middleware interfering, but did not find the cause. I think this is the bigger problem, assuming my local Sidekiq behaves comparably to the production one.

How I could reproduce the error

The dashboard keeps loading indefinitely if the backend's summary.json continuously responds with 202. The backend enqueues a worker to query Prometheus in the background. Until the query result is available, it responds with 202.

There are three cases where the response code never changes from 202:

No background worker executes and Prometheus is not queried.
A background worker gets enqueued but dies because of an unhandled exception.
A background worker executes successfully, but it stores nil in the Rails cache as a result.

I could reliably trigger (2) and suspect there is also a trigger for (1) hidden somewhere in Sidekiq middleware. I found no path leading to (3).

With (2), the problem is that Gitlab::PrometheusClient uses Gitlab::HTTP internally. However the async path of PrometheusAdapter does not rescue from all possible network errors in Gitlab::HTTP::HTTP_ERRORS. For example, when encountering persistent HTTP timeouts, the background worker raises, gets retried three times, and dies.

I simulated HTTP client timeouts by shelling into one of the Kubernetes VMs and pausing the Prometheus process:

bauerd@gke-dbauer-test-default-pool-1145310d-zzlf ~ $ pgrep prometheus
520712
bauerd@gke-dbauer-test-default-pool-1145310d-zzlf ~ $ sudo kill -STOP 520712

This makes the worker error:

… and sent the dashboard into the loading animation loop.

Sometimes, the worker doesn't execute

I noticed that irregularly, independent of what I did to Prometheus, no ExternalServiceReactiveCachingWorker would get executed by Sidekiq. I could see the workers in Sidekiq's "Scheduled" tab. However from the logs I added, I could tell these workers did not execute.

This caused the dashboard not to recover from failure:

I introduced a fault.
I removed the fault.
The dashboard remained in its loading animation loop.

I suspect there's some Sidekiq middleware which drops the worker jobs prior to execution

How I could not reproduce the error

The following faults did not result in a loading animation loop.

Prometheus service does not exist

Details

GMA might have failed to install the Prometheus release, or might have installed it into a wrong namespace.

Faulted how:

kubectl delete svc/prometheus-prometheus-server -n gitlab-managed-apps

Backend status code: 400

Prometheus service exists, but not backed by any Pods

Details

The Pods might be unschedulable because the cluster nodes are short on resources.

Faulted how:

kubectl scale --replicas=0 deploy/prometheus-prometheus-server -n gitlab-managed-apps

Backend status code: 400

Prometheus service exists, but Cilium timeseries not present

Details

The hubble_flows_processed_total metric might be missing from Prometheus because the Hubble exporter is dysfunctional, or because Prometheus fails to scrape it.

Faulted how: Renamed the metric in PacketFlowQuery.

Backend status code: 200

Prometheus service forwards to wrong container port

Details

I have no idea how that would occur but tried either way.

Faulted how: Changed Prometheus servicePort from 9090 to 9091

Backend response code: 400

Possible next steps

Investigate error tracking: There should be a pattern where some ExternalServiceReactiveCachingWorker fails continuously if e.g. timeouts are the primary cause.
Investigate the Sidekiq behaviour I don't understand.

Thanks @bauerdominic it looks like you are getting really deep into the inner workings here! I did notice that https://staging.gitlab.com/defend-team-test/cnp-alert-demo/-/threat_monitoring is now loading the Statistics tab again. I don't know what changed or how, but unfortunately it means that we no longer have a case of this actively erroring out anymore. In any case, thanks for the deep troubleshooting. It sounds like you are on the right track to finding the root problem here.

@sam.white, probably some network activity occurred. I saw that this was one of the error conditions.

Lack of network activity is not an error condition. I should have delineated the error state I'm trying to reproduce more clearly.

The dashboard UI can enter four states:

Loading animation (202 response)
Empty state (200 response, but all metrics are 0)
Rendered dashboard (200 response with metrics)
Error state (4xx or 5xx response)

What I'm trying to reproduce is the dashboard entering (1) and then never transitioning to another state. This results in a loading animation loop. It's also what's visible in the OP screenshot.

Without network activity recorded in the namespace, the Prometheus query still succeeds and returns a 200 response, but all metrics in the response are 0. The dashboard ends up in the empty state (2), not in the loading animation state (1).

@bauerdominic Incredible work! Thank you for the detailed reporting.

mentioned in merge request !73081 (merged)

My suspicion that Sidekiq middleware interferes with the job turned out to be wrong. Instead, I found the error to be local to my development environment. When I add debug output to the ReactiveCacheableWorker module, this breaks code reloading. The background jobs I was missing were stuck in "Busy" state, blocking on the same code (as verified by a Sidekiq thread dump). This concerned all jobs equally and was not limited to the caching worker.

I connected my existing cluster to a hosted project on GitLab.com, and the dashboard behaved as expected, too. Because stuck workers would raise alerts in production environments, too, I would by now also rule out possibility (1):

No background worker executes and Prometheus is not queried.

A background worker gets enqueued but dies because of an unhandled exception.

A background worker executes successfully, but it stores nil in the Rails cache as a result.

This would leave (3) for which I still cannot find any possible code paths.

Then I'm again left with (2), which I could reproduce in the case of network failures.

Given that the cnp-alert-demo dashboard also does not exhibit the infinite loading behaviour anymore, I'd conclude this for now until my MR made it into production and the behaviour is observable in some real-world cluster again.

I also created #344131 (closed) as a follow-up of this Issue as per @thiagocsf

Than you for the debugging updates, @bauerdominic. I agree and am moving this to workflowin review on your behalf. Once the MR is merged, feel free to move it to workflowverification.

@sam.white, maybe you would like to verify this one yourself?

@thiagocsf I wouldn't mind verifying it except that the project I was using that encountered this error suddenly started working again! I would be happy to verify that it is still working, but I won't be able to verify that the error state is caught properly.