Readiness review: OSQuery

OSQuery Readiness Review

OSQuery Readiness Review
Security/Compliance
- Monitoring and Alerts

Summary

OSQuery is an operating system instrumentation framework, providing an agent able to collect data from the underlying operating system and to expose such data as a high-performance relational database. This allows to write SQL queries to explore operating system data. With osquery, SQL tables represent abstract concepts such as running processes, loaded kernel modules, open network connections, browser plugins, hardware events or file hashes.

Compliance requirements dictate that OSQuery has to be installed on each Virtual Machine running production workloads which have not yet been migrated to Kubernetes. This means that an OSQuery agent has to be installed on the part of the Gitlab's fleet still running directly on Google Compute Engine (called "Legacy VM Infrastructure" in the Production Architecture Handbook Page).

Google Compute Engine machines which are part of a Kubernetes managed node group are exempt from this requirement, since security visibility into Kubernetes will be obtained via Falco at a later stage.

Work for its deployment has been tracked in the "OSQuery Deployment" EPIC.

Documentation

Architecture

Starting from the Gitlab.com Architecture diagram, OSQuery will impact only the "Legacy VM Infrastructure":

Within the "Legacy VM Infrastructure", each Compute Instance has an OSQuery Agent running as a system service.

OSQuery is deployed via a custom Chef Cookbook (gitlab-osquery), which includes recipes and resources to install, configure, and start OSQuery. In particular, when run, the cookbook will:

Install the osquery package, obtained from https://pkg.osquery.io
Setup syslog ingestion
Create relevant configurations files (e.g., /etc/osquery/osquery.conf)
Deploy query packs
Start/enable the osqueryd service

Fluentd is used to collect the OSQuery output from the local filesystem (of the hosts running OSQuery) and to forward it to a Pub/Sub topic which is the entry point for Panther (the SIEM used by SIRT). Dedicated Pub/Sub Topics for Staging and Production are defined in the Terraform variables of the gitlab-com-infrastructure repository (see gstg and gprd).

The gitlab_fluentd configuration has also been extended for processing logs generated by OSQuery (see recipe and template).

Performance

It is known that a previous rollout of OSQuery in 2019 caused performance issues on the Production hosts, which ultimately led to the decision to abandon the rollout.

The main differences with this rollout (2021) are:

We won't leverage a 3rd party vendor (uptycs), but the vanilla open source version of OSQuery, which will allow us to have full control over the configuration/tuning of the OSQuery agent.
We won't enable file integrity monitoring for every file access. In fact, during the first iteration of the OSQuery rollout, file integrity monitoring will be completely disabled.
The configuration options related to performance have been tuned in alignment with the thresholds defined by Palantir in their open source configuration of OSQuery.

In addition, CPU and Memory usage have been checked in Staging, and are as follows:

console-01-sv-gstg.c.gitlab-staging-1.internal:~$ ps aux | grep osqueryd
root      6716  0.0  0.2 146192 33556 ?        SNsl 08:20   0:00 /usr/bin/osqueryd --flagfile /etc/osquery/osquery.flags --config_path /etc/osquery/osquery.conf

console-01-sv-gstg.c.gitlab-staging-1.internal:~$ ps -p 6716 -o %cpu,%mem
%CPU %MEM
 0.0  0.2

Availability

In case of an internal error, the OSQuery daemon might stop its execution. Although this won't have any direct impact on the host on which it is running, the SIRT team will be able to notice a lack of incoming logs.

Durability

OSQuery logs are shipped, via fluentd, to a dedicated Pub/Sub Topic. From there, the SIRT team will be able to consume such logs and to store them accordingly.

Security/Compliance

Compliance requirements mandated the rollout of OSQuery itself.
OSQuery won't generate any data itself, rather it will collect system events and will store them in a local log file.
From there, fluentd will parse such log file and will forward entries to a dedicated Pub/Sub topic.

Monitoring and Alerts

The most likely issue deriving from the OSQuery rollout might be related to an eventual performance penalty on the underlying hosts.

Spikes in CPU and/or memory usage can be detected by standard monitoring already in place on such hosts. For example, this Grafana dashboard could be helpful to identify the hosts where osqueryd is using the most CPU, memory or IO.

In addition, an alert has been created to trigger whenever the osqueryd process is using more than 10% CPU.

Edited Aug 18, 2021 by Marco Lancini (GitLab)