Develop static analysis for pipeline configs

Problem to solve

Analyzing complex pipeline configurations is challenging.

The current, best tool available for understanding pipeline configuration is the pipeline editor.

The pipeline editor is appropriate for interactive development within a "concrete" project, where external files, local, remote, or template, are available.

For other workflows, such as offline linting or security testing, the best tooling we have is syntactical, e.g. YAML schemas and yq.

Working with syntax alone is time consuming, leaves blind spots, and is vulnerable to semantic error.

Proposal

By modeling pipeline configuration as a high level programming language and applying techniques from static analysis we can enable faster, complete, and correct analysis.

The enumeration of image values is a good first goal with useful intermediate results to iterate on.

e.g.

Develop a command line tool that finds all image values that can be used by a given pipeline configuration.

images should be as resolved as possible, e.g. from include and !reference
if an external file cannot be resolved, analysis should continue, recording the circumstance
deficiencies in the analysis, like missing external files, should be indicated in the final report

Implementation plan

parse pipeline configuration YAML into golang data types
- to begin with, do not handle files with spec:inputs
(partially) resolve external file references
- using YAML on the local file database
- using remote YAML
compute a "merge plan" for each job name by resolving included-overridden jobs (by name) and extends
- determine the precedence for the set of jobs with a given name
- provide a mechanism to "execute" a merge plan for a named job and query its members
- parse and represent dynamic / "spec:inputs" config files but postpone interpolation
compute job control flow, e.g. the directed graph on (merged) jobs, based on stages and needs
compute data flow
- use a symbol table keyed on variable name and kind: pipeline, job, environment
- job inputs: identify where variables can be used, and what kind takes priority
- job outputs: identify how variables can be defined, including dotenv artifacts and triggers
- consider representation of unevaluated expressions as "spec:inputs"
emit use-def chains for each job's (implicit) image key
emit def-use changes for each image value
integrate shfmt for parsing script values for usage and assignment (in dotenv artifacts)

References

The rails implementation is ground truth, see lib/gitlab/config and lib/gitlab/ci/config.

Hardening in the handbook.

Develop static analysis for pipeline configs

Problem to solve

Proposal

Implementation plan

References

Related