Readiness review: OSQuery
OSQuery Readiness Review
Summary
OSQuery is an operating system instrumentation framework, providing an agent able to collect data from the underlying operating system and to expose such data as a high-performance relational database. This allows to write SQL queries to explore operating system data. With osquery, SQL tables represent abstract concepts such as running processes, loaded kernel modules, open network connections, browser plugins, hardware events or file hashes.
Compliance requirements dictate that OSQuery has to be installed on each Virtual Machine running production workloads which have not yet been migrated to Kubernetes. This means that an OSQuery agent has to be installed on the part of the Gitlab's fleet still running directly on Google Compute Engine (called "Legacy VM Infrastructure" in the Production Architecture Handbook Page).
Google Compute Engine machines which are part of a Kubernetes managed node group are exempt from this requirement, since security visibility into Kubernetes will be obtained via Falco at a later stage.
Work for its deployment has been tracked in the "OSQuery Deployment" EPIC.
Documentation
Architecture
Starting from the Gitlab.com Architecture diagram, OSQuery will impact only
the "Legacy VM Infrastructure":
Within the "Legacy VM Infrastructure", each Compute Instance has an OSQuery Agent running as a system service.
OSQuery is deployed via a custom Chef Cookbook (gitlab-osquery), which includes recipes and resources to install, configure, and start OSQuery. In particular, when run, the cookbook will:
- Install the
osquery
package, obtained from https://pkg.osquery.io - Setup syslog ingestion
- Create relevant configurations files (e.g.,
/etc/osquery/osquery.conf
) - Deploy query packs
- Start/enable the
osqueryd
service
Fluentd is used to collect the OSQuery output from the local filesystem (of the hosts running OSQuery) and to forward it to a Pub/Sub topic which is the entry point for Panther (the SIEM used by SIRT). Dedicated Pub/Sub Topics for Staging and Production are defined in the Terraform variables of the gitlab-com-infrastructure repository (see gstg and gprd).
The gitlab_fluentd configuration has also been extended for processing logs generated by OSQuery (see recipe and template).
Performance
It is known that a previous rollout of OSQuery in 2019 caused performance issues on the Production hosts, which ultimately led to the decision to abandon the rollout.
The main differences with this rollout (2021) are:
- We won't leverage a 3rd party vendor (
uptycs
), but the vanilla open source version of OSQuery, which will allow us to have full control over the configuration/tuning of the OSQuery agent. - We won't enable file integrity monitoring for every file access. In fact, during the first iteration of the OSQuery rollout, file integrity monitoring will be completely disabled.
- The configuration options related to performance have been tuned in alignment with the thresholds defined by Palantir in their open source configuration of OSQuery.
In addition, CPU and Memory usage have been checked in Staging, and are as follows:
console-01-sv-gstg.c.gitlab-staging-1.internal:~$ ps aux | grep osqueryd
root 6716 0.0 0.2 146192 33556 ? SNsl 08:20 0:00 /usr/bin/osqueryd --flagfile /etc/osquery/osquery.flags --config_path /etc/osquery/osquery.conf
console-01-sv-gstg.c.gitlab-staging-1.internal:~$ ps -p 6716 -o %cpu,%mem
%CPU %MEM
0.0 0.2
Availability
In case of an internal error, the OSQuery daemon might stop its execution. Although this won't have any direct impact on the host on which it is running, the SIRT team will be able to notice a lack of incoming logs.
Durability
OSQuery logs are shipped, via fluentd
, to a dedicated Pub/Sub Topic.
From there, the SIRT team will be able to consume such logs and to store them accordingly.
Security/Compliance
- Compliance requirements mandated the rollout of OSQuery itself.
- OSQuery won't generate any data itself, rather it will collect system events and will store them in a local log file.
- From there,
fluentd
will parse such log file and will forward entries to a dedicated Pub/Sub topic.
Monitoring and Alerts
The most likely issue deriving from the OSQuery rollout might be related to an eventual performance penalty on the underlying hosts.
Spikes in CPU and/or memory usage can be detected by standard monitoring
already in place on such hosts. For example, this Grafana dashboard could be helpful to identify the hosts where osqueryd
is using the most CPU, memory or IO.
In addition, an alert has been created to trigger
whenever the osqueryd
process is using more than 10% CPU.