Capture helper service logs into job/tasks main trace log
What does this MR do?
This MR takes everything from !3551 (closed) and !3564 (closed), except for the final 2 commits of each MR, and consolidates them into a single MR. The goal is to get as much as possible of this code merged, without actually enabling the feature. The latter is blocked on gitlab!100349 (merged)
This MR adds (nearly) all the code necessary to capture and streams logs from helper service containers to the CI task/job's main trace logs for the docker and kubernetes executor, but does not enable the feature. The single line to enable the feature for each of the docker
and kubernetes
executors, plus integration tests for each will follow in a subsequent MR.
This functionality is will be enabled when the variable
CI_DEBUG_SERVICES = true
is set in gitlab-ci.yaml
or config.toml
. While the logs are currently written to the jobs trace logs, they could easily be written elsewhere (file, log aggregator service, syslog...) in the future.
docker
The approach in this MR relies on the docker.Client.CaptureLogs()
API; I also considered ContainerAttach()
as it was also suitable. In the end, for this use case, both API's are very similar; both return an FD in the form of an io.Reader()
into which the container's stdout
and stderr
are multiplexed, and on which stdcopy.StdCopy()
can be used to read the contents. CaptureLogs()
was a tad simpler so I chose it.
kubernetes
The approach in this MR relies on the kubernetes.Clientset().CoreV1().Pods().GetLogs()
API. Despite the knows problems with that API, we're using it here as a first iteration, which may turn out to be good enough (see discussion #2119 (comment 1072311180)).
Why was this MR needed?
!2119 (merged) asks for service logs to be captured somehow. This is to help debug failing jobs when the failure is (at least in part) caused by behaviour in one of the service container services' (though not necessarily a failure to start said service container). A few possible approaches are mentioned in the issue; this MR takes the most "iteration friendly" (i.e. simplest) approach of copying the service logs inline into the main trace logs, but leaves room for the logs to be written elsewhere (e.g. a file) in the future.
What's the best way to test this MR?
-
BASELINE: Don't specify CI_DEBUG_SERVICES and run a CI job with a service container. The output of the main log trace should be unchanged from
main
. -
Set CI_DEBUG_SERVICES to a bogus value. The error message
invalid value '<xxx>' for CI_DEBUG_SERVICES variable
should appear in the main trace logs.
Example https://gitlab.com/avonbertoldi/test-project/-/jobs/2786325292
- Set
CI_DEBUG_SERVICES = true
in the CI configuration- register a runner with a
docker
andkubernetes
executors (one at a time to test each executor respectively) - create a job that includes a service which writes logs (example below)
- run the job (using a runner built from this branch)
- the service container's logs should appear in the job's main log trace in grey colour, with the container name prefixed to the log lines.
- register a runner with a
gitlab-ci.yaml
stages:
- test
variables:
POSTGRES_PASSWORD: password
CI_DEBUG_SERVICES: "true"
format:
stage: test
image:
name: alpine
services:
- postgres:latest
- redis:latest
script:
- sleep 30
Example https://gitlab.com/avonbertoldi/test-project/-/jobs/2841278059
What are the relevant issue numbers?
- The original issue #2119 (closed)
- Related #28063 (closed)
-
gitlab#375180 (closed) is relevant only in that it's solutions block us from adding access control of job logs when this feature is enabled (as is done currently with
CI_DEBUG_TRACE
) as requested by @dcouture - gitlab#290955 (closed) in that gitlab#375180 (closed) was closed in favour of gitlab#290955 (closed). gitlab#290955 (closed) Is a big piece of work though, and as a first iteration to get this MR and related work merged, we'll do something like gitlab!100349 (merged)c first, then get back to gitlab#290955 (closed).
Notes:
- Best reviewed commit-at-a-time.
-
@ratchade @ajwalker there is no new code here compared to !3551 (closed) and !3564 (closed). You have both approved the former, and @ratchade has approved the latter. The only difference between this MR and those two MRs is that the
k8s
MR moved some content around, and in this MR that happens earlier; the final content is exactly the same. All the changes required to resolve issues you raised in the other MRs were carried over to here. I did fix a couple of typos I found while doing a final-self review.