2025-08-27: Disk space utilization on gitaly nodes exceeding SLO and nearing capacity
Disk space utilization on gitaly nodes exceeding SLO and nearing capacity (Severity 3)
Problem: Disk space utilization on a Gitaly node exceeded the defined Service Level Objective (SLO) and was close to reaching full capacity due to a large performance capture file.
Impact: The issue caused disk space saturation on the affected node.
Causes: The problem was caused by a large performance capture file that did not release disk space even after deletion because a 'perf' process was still holding the file handle open.
Response strategy: The resolution involved deleting the oversized performance capture file and forcibly terminating the 'perf' process that was preventing disk space from being freed. After these actions, disk usage on the affected node returned to normal.
This ticket was created to track INC-3526, by incident.io