Verified Commit a57a0b69 authored by Michael Hofmann's avatar Michael Hofmann
Browse files

Move OpenShift info from internal documentation


Signed-off-by: Michael Hofmann's avatarMichael Hofmann <mhofmann@redhat.com>
parent 357e287c
......@@ -30,7 +30,8 @@ muffet http://localhost:1313/ \
-e 'https://datagrepper.engineering.redhat.com/.*' \
-e 'https://gitlab.corp.redhat.com/.*' \
-e 'https://datawarehouse.internal.cki-project.org' \
-e 'https://rover.redhat.com/.*'
-e 'https://rover.redhat.com/.*' \
-e 'https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/resource-qos.md#requests-and-limits'
# if no dangling links were found, shut down normally
kill %1
......
---
title: "Working with Kubernetes"
description: >
How to access and work with production environments based on Kuberenetes
---
## Problem
You want to access and work with any of the [Kubernetes-based environments]
used for deployment in [cee/deployment-all].
## Setup
The easiest way to get started with a Kubernetes environment is to access it
via the [URLs of the Web console].
You can also manage Kubernetes resources using the command line tools. On
Fedora, install the tools with: `dnf install origin-clients`.
There are two options for authentication:
- From the Web console, find your name at the top right and click it. Choose
*Copy login command* from the drop down and paste the command into your
terminal. You will be logged in automatically with a one time token.
- If the cluster is hooked up to Red Hat LDAP, run `oc login
https://api-server-url`. Enter your Kerberos username and password when
prompted.
## OpenShift peculiarities to keep in mind
OpenShift has some strict rules for containers to maintain security:
- No packages can be installed once the container is running. All packages
that you need for the container must be installed into the container image
itself.
- You cannot be root inside the container and your Linux capabilities are
highly restricted. For example, ICMP pings and `chown` are not allowed.
- Each container starts with [an arbitrary UID/GID pair]. The pair is different
per project, but constant across invocations. Some applications, like `git`
and `ansible` have issues with abitrary UID/GID pairs, but there are
workarounds for this. See [Handling arbitrary UIDs and
GIDs](#handling-abitrary-uids-and-gids) below.
- The default resource allotments set by the namespace LimitRange might be very
low and it might be necessary to explicitly specify how much RAM and CPU the
container is allowed. Some applications may work with the defaults, but you
may experience strange issues or abrupt container restarts from out of memory
errors. See [Resource allocation](#resource-allocation) for details.
## Watching a running container
You can watch the logs from a deployment or container using the `oc` command
line tools. This can be very helpful if you are rapidly iterating a
DeploymentConfig and trying to see if the container runs properly.
Here's an example for monit: `oc logs -f dc/irc-bot`. This will tail the logs as
the container runs.
Occasionally, the connection will drop between you and OpenShift. You can keep
monitoring logs indefinitely by using something like this:
```bash
while true; do oc logs --tail 5 -f dc/irc-bot; done
```
This will force a reconnection each time it disconnects.
## Handling abitrary UIDs and GIDs
When a container starts in OpenShift, it is assigned [an arbitrary UID/GID
pair]. This provides additional security for the host underneath the
container. However, it can make some application misbehave because calls to
`id` or `groups` will fail or return strange information.
The following changes are implemented in the CKI project to make these changing
UID/GID combinations easier.
### Writable /etc/passwd and /etc/group
The [cleanup include file] used during container image builds ensures that
container images have writable `/etc/passwd` and `/etc/group` files:
```dockerfile
# Make everybody happy again with arbitrary UID/GID in OpenShift
RUN chmod g=u /etc/passwd /etc/group
```
### Current user/group added to /etc/passwd
The [default CKI container image entry point script] and [cronjob template] run the
following commands very early after container startup to ensure the current
user can be found in `/etc/passwd`:
```bash
if [ -w '/etc/passwd' ] && ! id -nu > /dev/null 2>&1; then
echo "cki:x:$(id -u):$(id -g):,,,:${HOME}:/bin/bash" >> /etc/passwd;
fi
```
## Resource allocation
By default, the namespace [LimitRange] might set very low default RAM and CPU
quotas for each container. Most applications will require higher limits to work
properly:
In most cases, caring about [Requests and Limits] for CPU and memory should be
good enough. While requests and limits are specified on a container level, they
are used on a Pod level via `max(...init containers, sum(containers))`.
Limits are strictly enforced, i.e. Pods can never use more resources. For CPU,
cgroups are used to limit resource consumption. For memory, Pods are killed
when exceeding the specified limit.
Requests are used for scheduling decisions, i.e. the total request for all Pods
on a node cannot exceed the available resources on that node. Also keep in mind
that some Pods on a node might not specify resource requests at all. For
resource-hungry Pods, make sure that nodes are available that have enough
resources to run the Pod.
## Cron jobs
Recurring jobs are deployed as CronJobs. Cron jobs are not visible in the
standard OpenShift Application Console, but can be found in the OpenShift
Cluster Console.
To get a list of all cron jobs from the command line, you can use
```text
$ oc get cronjob
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
acme-update-cluster-routes-daily 40 4 * * * False 0 11h 63d
cronjobs-acme-certs-daily 30 4 * * * False 0 11h 35d
cronjobs-acme-patch-remote-daily 40 4 * * * False 0 11h 35d
...
```
As an example, the cronjobs-acme-certs-daily CronJob spawns a Job which spawns
a Pod with the actual containers. To get a list of everything related to one
schedule, you can use something like
```text
$ oc get job,pod -l schedule_job=cronjobs-acme-certs-daily
NAME COMPLETIONS DURATION AGE
job.batch/cronjobs-acme-certs-daily-1630989000 1/1 22s 2d11h
job.batch/cronjobs-acme-certs-daily-1631075400 1/1 22s 35h
job.batch/cronjobs-acme-certs-daily-1631161800 1/1 24s 11h
NAME READY STATUS RESTARTS AGE
pod/cronjobs-acme-certs-daily-1630989000-rk4cp 0/1 Completed 0 2d11h
pod/cronjobs-acme-certs-daily-1631075400-267zw 0/1 Completed 0 35h
pod/cronjobs-acme-certs-daily-1631161800-2m8wh 0/1 Completed 0 11h
```
To see the output of a schedule, you can use `oc logs` with the Job or the Pod like
```text
$ oc logs job.batch/cronjobs-acme-certs-daily-1631161800
...
Checking registration
...
$ oc logs pod/cronjobs-acme-certs-daily-1631161800-2m8wh
...
Checking registration
...
```
[Kubernetes-based environments]: https://documentation.internal.cki-project.org/deployment/environments/
[URLs of the Web console]: https://documentation.internal.cki-project.org/deployment/environments/
[cee/deployment-all]: https://gitlab.cee.redhat.com/cki-project/deployment-all
[an arbitrary UID/GID pair]: https://docs.openshift.com/container-platform/3.11/creating_images/guidelines.html#openshift-specific-guidelines
[cleanup include file]: https://gitlab.com/cki-project/containers/-/blob/main/includes/cleanup
[default CKI container image entry point script]: https://gitlab.com/cki-project/cki-lib/-/blob/main/cki_entrypoint.sh
[cronjob template]: https://gitlab.cee.redhat.com/cki-project/deployment-all/-/blob/main/openshift/templates/40-cronjob.yml.j2
[LimitRange]: https://kubernetes.io/docs/concepts/policy/limit-range/
[Requests and Limits]: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/resource-qos.md#requests-and-limits
---
title: Shutting down CKI
description: How to communicate a planned or unplanned shutdown of kernel testing
---
Internally, the One Portal should be used to announce planned and unplanned shutdowns.
For external clients, email is used instead.
## Outage
Using the [Create Outage] page, fill in the following form:
```text
* Title: { Outage description }
* Type: Incident
* Services: CKI
* Affected Locations/Data Center: { Affected locations }
* Impacted Users: { Use the preset value }
* Performer: { The person in charge of following the outage }
* Hostnames: { If applicable, e.g. gitlab.com }
* Status: Ongoing
* Priority: Critical
* Description: Description of the outage.
* Impact: All automated kernel testing is impacted. Testing that was running
when the outage started will have to be restarted and all new testing will be
delayed until the outage is over.
* Ticket: { Link to the SNOW Ticket }
```
Use this form when CKI is experiencing an outage or a service CKI depends on undergoes
a maintenance that cannot be worked around.
Make sure to also add the outage to the [outage spreadsheet].
## Maintenance
For CKI maintenances, fill in the form in [Create Outage] with the following data:
```text
* Title: { Maintenance description }
* Type: Maintenance
* Services: CKI
* Affected Locations/Data Center: { Affected locations }
* Impacted Users: { Use the preset value }
* Performer: { The person in charge of following the maintenance }
* Hostnames: { If applicable, e.g. gitlab.com }
* Status: Scheduled
* Date and Time, Estimated time required: { Date of the outage, including time
to check that everything is up and running. }
* Priority: Medium
* Description: { Description of the maintenance }
FAQ:
Q: What if a test is running for one of my patches when the triggers are
disabled?
A: All of the tests that are running when the triggers are disabled will
be allowed to finish.
Q: What if I submit patches or Brew builds after the triggers are disabled?
A: Those patches will be tested as soon as Beaker and the CKI pipelines
are back online. None of those patches will be lost.
Q: How can I stay informed about what is happening with the shutdown?
A: Follow along with updates in this thread or join #kernelci on Red Hat IRC.
Q: I have more questions about how this shutdown will affect me.
A: Email us at cki-project@redhat.com or join #kernelci on Red Hat IRC.
* Impact:
2019-12-23 16:00 UTC: CKI team disables all pipeline triggers.
2019-12-26 08:00 UTC: GitLab maintenance begins.
2019-12-30 12:00 UTC: CKI pipelines back online and testing kernels.
```
## External announcement email
```text
To: stable@vger.kernel.org
Cc: cki-project@redhat.com
Subject: CKI Project Shutdown: 2019-12-23 to 2019-12-30
Hello there,
The CKI team is planning to shutdown the kernel testing pipelines
including stable kernels during the holidays.
Shutdown timeline:
2019-12-23 16:00 CET: CKI kernel testing pipelines are disabled.
2019-12-30 12:00 CET: CKI kernel testing pipelines back online and testing.
FAQ:
Q: What if a test is running for one of my commits when the pipelines
are disabled?
A: All of the tests that are running when the pipelines are disabled will
be allowed to finish.
Q: What if I commit patches to one of the tested kernel trees after the
pipelines are disabled?
A: The tip of those kernel trees will be tested as soon as the pipelines
are back online.
Q: I have more questions about how this shutdown will affect me.
A: Email us at cki-project@redhat.com.
Thank you! Michael Hofmann and the CKI Project Team 🤖
```
[Create Outage]: https://one.redhat.com/outages/create
[outage spreadsheet]: https://docs.google.com/spreadsheets/d/1jeFI_AY56JkAQ6MRhGE0h5ckQU47ew4UJPQ76il3N-g/edit#gid=0
---
title: PSI escalation procedure
description: How to escalate PSI infrastructure problems
---
## Problem
When PSI infrastructure fails, problems should be escalated in a structured way.
## Steps
1. File a [SNOW ticket]:
- Search for 'PnT report an issue'
- Impact: 3 - Affects multiple teams
- Urgency: 2 - No workaround; blocks business-critical processes
- Application: DevOps - PSI-OCP (or correct application)
- Support group: Openshift PNT (or correct group)
1. If after an hour, no response on SNOW ticket
- Poke someone on the [exd-infra-escalation] Google Chat channel
This flow should only be used for real problems and not one-off failures.
A good time frame is to verify that a problem is occurring consistently for
~15mins (and verify it's really caused by PSI OpenShift) before submitting the
initial ticket/pinging people on the Google Chat channel.
Make sure to also add the outage to the [outage spreadsheet].
[SNOW ticket]: https://helpdesk.redhat.com
[exd-infra-escalation]: https://chat.google.com/room/AAAA6BChWkY
[outage spreadsheet]: https://docs.google.com/spreadsheets/d/1jeFI_AY56JkAQ6MRhGE0h5ckQU47ew4UJPQ76il3N-g/edit
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment