Add generic Linux observability tools to our hosts

As an SRE or DBRE, I want our Linux hosts to include tools for ad hoc observation, so that I can collect short-term metrics and investigate behaviors that are impractical or out of scope for our general purpose monitoring.

Background: Prometheus provides a variety of metrics collected periodically. This serves us well for most purposes. However, sometimes we need more granular or narrowly scoped instrumentation for ad hoc investigations.

Examples:

  • Polling disk I/O statistics at 1-second intervals is helpful when analyzing suspected bursts of I/O contention that get smoothed over at less frequent polling intervals.
  • Measuring variation in memory access latency on a VM that runs an in-memory data store (e.g. Redis) can reveal an otherwise opaque root cause to transient query response time spikes.
  • Measuring trends in TCP connection open/close events can help inform tuning the kernel's TCP stack to more gracefully handle traffic spikes. It can also lead to improving our general purpose diagnostic monitoring, to alert us when approaching saturation of certain finite resources (e.g. TCP connection table, pool of available client ports, etc.).

In this Issue, let's build a wish list of tools we'd like to have available on our Linux hosts. To start us off, we have this list from the Slack discussion:

  • iostat: Per block device I/O statistics, including queue depth, %busy, mean read/write latency, etc.
  • sysstat: Provides "sar" utility, which gives wide variety of usage statistics, system-wide or for specific PIDs.
  • linux-tools: Lightweight tracing facility, used via the perf-suite tools (perf, perf-top, perf-mem, perf-trace, perf-ftrace, etc.), with eBPF support for recent kernels.
  • iftop: "top" for network flows, showing which remote IPs are currently using the most network throughput.
  • ifstat: "vmstat" for network interfaces, polling network throughput on each interface.