Improve KubeSOS log parsing and add toolbox pod commands support
Context
As part of Support's preparation for the Kubernetes-first deployment strategy, we need to improve KubeSOS to be on par with GitLabSOS.
Requirements
Based on feedback from @jessie:
-
Improve log parsing - The logs are currently not the easiest to parse, especially when it comes to performance issues. We need to explore ways to make log analysis more accessible and actionable for Support Engineers.
-
Toolbox pod commands - Revisit whether to run rake commands and SQL queries in the toolbox pod, similar to what we currently run with GitLabSOS.
Related Issues
- Parent issue: gitlab-com/support/support-team-meta#7283
- Epic: gitlab-org&19744
Proposed Solutions
1. Improve Log Parsing (Performance Issues)
Option A: Structured Log Analysis with jq + Performance Metrics Extraction
- Build on GitLab's JSON-structured logs (already available in Kubernetes deployments)
- Create pre-built jq filters specifically for performance analysis:
- Extract slow queries from
production_json.log(duration_ms, db_duration_s) - Parse Sidekiq performance metrics from subcomponent logs
- Identify N+1 queries and memory issues
- Filter by correlation_id to trace requests across pods
- Extract slow queries from
- Package these as reusable scripts in KubeSOS
Option B: Integration with fast-stats
- Leverage the existing fast-stats tool that Support already uses
- Extend it to work with Kubernetes log formats
- Provides statistical analysis and comparison capabilities already built for GitLab logs
Option C: Log Aggregation Helper
- Create a KubeSOS command that automatically:
- Collects logs from all relevant pods (webservice, sidekiq, gitaly, etc.)
- Merges them by timestamp and correlation_id
- Applies performance-focused filters
- Outputs in a format easier to analyze (CSV, formatted tables, or highlighted JSON)
Recommended approach: Combine A + B - use jq-based filters for quick analysis and integrate with fast-stats for deeper statistical analysis.
2. Toolbox Pod Commands Support
Option A: Direct Rake Task Execution
- Add KubeSOS commands that wrap common rake tasks:
-
kubesos rake <task>→ executes in toolbox pod - Pre-configure common diagnostics:
gitlab:check,gitlab:env:info,gitlab:doctor:* - Support custom rake tasks with proper error handling
-
Option B: SQL Query Runner
- Implement safe SQL execution via toolbox pod:
- Read-only queries by default
- Pre-built query library for common diagnostics
- Query result formatting (table, CSV, JSON)
- Connection pooling awareness
Option C: Diagnostic Command Library
- Create a curated set of diagnostic commands similar to GitLabSOS:
- Database statistics (table sizes, index usage, slow queries)
- Redis info and key analysis
- Gitaly storage checks
- Background job queue analysis
- Each command handles the kubectl exec complexity internally
Recommended approach: Start with Option C (diagnostic library) + Option A (rake wrapper). This provides immediate value while maintaining safety. Add Option B (SQL runner) later with appropriate safeguards.
Implementation Priorities
Phase 1 (Quick wins):
-
Create jq filter library for common performance patterns -
Add basic rake task wrapper ( kubesos rake) -
Implement log collection/aggregation helper
Phase 2 (Enhanced diagnostics):
-
Build diagnostic command library (database, Redis, Gitaly checks) -
Integrate with fast-stats for statistical analysis -
Add correlation_id-based request tracing
Phase 3 (Advanced features):
-
Safe SQL query runner with read-only mode -
Performance baseline comparison tools -
Automated performance issue detection