Improve KubeSOS log parsing and add toolbox pod commands support

Context

As part of Support's preparation for the Kubernetes-first deployment strategy, we need to improve KubeSOS to be on par with GitLabSOS.

Requirements

Based on feedback from @jessie:

  1. Improve log parsing - The logs are currently not the easiest to parse, especially when it comes to performance issues. We need to explore ways to make log analysis more accessible and actionable for Support Engineers.

  2. Toolbox pod commands - Revisit whether to run rake commands and SQL queries in the toolbox pod, similar to what we currently run with GitLabSOS.

Related Issues

  • Parent issue: gitlab-com/support/support-team-meta#7283
  • Epic: gitlab-org&19744

Proposed Solutions

1. Improve Log Parsing (Performance Issues)

Option A: Structured Log Analysis with jq + Performance Metrics Extraction

  • Build on GitLab's JSON-structured logs (already available in Kubernetes deployments)
  • Create pre-built jq filters specifically for performance analysis:
    • Extract slow queries from production_json.log (duration_ms, db_duration_s)
    • Parse Sidekiq performance metrics from subcomponent logs
    • Identify N+1 queries and memory issues
    • Filter by correlation_id to trace requests across pods
  • Package these as reusable scripts in KubeSOS

Option B: Integration with fast-stats

  • Leverage the existing fast-stats tool that Support already uses
  • Extend it to work with Kubernetes log formats
  • Provides statistical analysis and comparison capabilities already built for GitLab logs

Option C: Log Aggregation Helper

  • Create a KubeSOS command that automatically:
    • Collects logs from all relevant pods (webservice, sidekiq, gitaly, etc.)
    • Merges them by timestamp and correlation_id
    • Applies performance-focused filters
    • Outputs in a format easier to analyze (CSV, formatted tables, or highlighted JSON)

Recommended approach: Combine A + B - use jq-based filters for quick analysis and integrate with fast-stats for deeper statistical analysis.

2. Toolbox Pod Commands Support

Option A: Direct Rake Task Execution

  • Add KubeSOS commands that wrap common rake tasks:
    • kubesos rake <task> → executes in toolbox pod
    • Pre-configure common diagnostics: gitlab:check, gitlab:env:info, gitlab:doctor:*
    • Support custom rake tasks with proper error handling

Option B: SQL Query Runner

  • Implement safe SQL execution via toolbox pod:
    • Read-only queries by default
    • Pre-built query library for common diagnostics
    • Query result formatting (table, CSV, JSON)
    • Connection pooling awareness

Option C: Diagnostic Command Library

  • Create a curated set of diagnostic commands similar to GitLabSOS:
    • Database statistics (table sizes, index usage, slow queries)
    • Redis info and key analysis
    • Gitaly storage checks
    • Background job queue analysis
  • Each command handles the kubectl exec complexity internally

Recommended approach: Start with Option C (diagnostic library) + Option A (rake wrapper). This provides immediate value while maintaining safety. Add Option B (SQL runner) later with appropriate safeguards.

Implementation Priorities

Phase 1 (Quick wins):

  • Create jq filter library for common performance patterns
  • Add basic rake task wrapper (kubesos rake)
  • Implement log collection/aggregation helper

Phase 2 (Enhanced diagnostics):

  • Build diagnostic command library (database, Redis, Gitaly checks)
  • Integrate with fast-stats for statistical analysis
  • Add correlation_id-based request tracing

Phase 3 (Advanced features):

  • Safe SQL query runner with read-only mode
  • Performance baseline comparison tools
  • Automated performance issue detection

References

cc @jessie @cms

Edited by Chris Stone