Develop static analysis for pipeline configs

Problem to solve

Analyzing complex pipeline configurations is challenging.

The current, best tool available for understanding pipeline configuration is the pipeline editor.

The pipeline editor is appropriate for interactive development within a "concrete" project, where external files, local, remote, or template, are available.

For other workflows, such as offline linting or security testing, the best tooling we have is syntactical, e.g. YAML schemas and yq.

Working with syntax alone is time consuming, leaves blind spots, and is vulnerable to semantic error.

Proposal

By modeling pipeline configuration as a high level programming language and applying techniques from static analysis we can enable faster, complete, and correct analysis.

The enumeration of image values is a good first goal with useful intermediate results to iterate on.

e.g.

Develop a command line tool that finds all image values that can be used by a given pipeline configuration.

  • images should be as resolved as possible, e.g. from include and !reference
  • if an external file cannot be resolved, analysis should continue, recording the circumstance
  • deficiencies in the analysis, like missing external files, should be indicated in the final report

Implementation plan

  • parse pipeline configuration YAML into golang data types
    • to begin with, do not handle files with spec:inputs
  • (partially) resolve external file references
    • using YAML on the local file database
    • using remote YAML
  • compute a "merge plan" for each job name by resolving included-overridden jobs (by name) and extends
    • determine the precedence for the set of jobs with a given name
    • provide a mechanism to "execute" a merge plan for a named job and query its members
    • parse and represent dynamic / "spec:inputs" config files but postpone interpolation
  • compute job control flow, e.g. the directed graph on (merged) jobs, based on stages and needs
  • compute data flow
    • use a symbol table keyed on variable name and kind: pipeline, job, environment
    • job inputs: identify where variables can be used, and what kind takes priority
    • job outputs: identify how variables can be defined, including dotenv artifacts and triggers
    • consider representation of unevaluated expressions as "spec:inputs"
  • emit use-def chains for each job's (implicit) image key
  • emit def-use changes for each image value
  • integrate shfmt for parsing script values for usage and assignment (in dotenv artifacts)

References

The rails implementation is ground truth, see lib/gitlab/config and lib/gitlab/ci/config.

Hardening in the handbook.

Related

Edited by Jason Leasure