Unfortunately, there will always be some cases where OpenShift fails to install properly. In these events, it is helpful to understand the likely failure modes as well as how to troubleshoot the failure.
When users are using the installer to create the OpenShift cluster, the installer has all the information to automatically capture the logs from bootstrap host in case of failure.
The installer will use the user's environment to discover the credentials to connect to the bootstrap host over SSH. One of the following methods is used by the installer,
-
Use the user's already setup
SSH_AGENT
. If the user has a ssh-agent setup, the installer will use it for SSH authentication. -
Use the user's home directory,
~/.ssh
on Linux hosts, to load all the SSH private keys and use those for SSH authentication. a. The installer also configures the bootstrap host with a generated SSH key, and this private key will be used for SSH authentication if none of the user keys are trusted. The installer only configures the bootstrap host to trust the generated key, and therefore the log bundle will only contain the logs from the bootstrap host and not the control-plane hosts.
When users are creating the infrastructure for the OpenShift cluster and the cluster fails to bootstrap, the users can use the gather bootstrap
subcommand to gather the logs from the bootstrap host.
$ openshift-install gather bootstrap --help
Gather debugging data for a failing-to-bootstrap control plane
Usage:
openshift-install gather bootstrap [flags]
Flags:
--bootstrap string Hostname or IP of the bootstrap host
-h, --help help for bootstrap
--key stringArray Path to SSH private keys that should be used for authentication. If no key was provided, SSH private keys from user's environment will be used
--master stringArray Hostnames or IPs of all control plane hosts
An example of a invocation for a cluster with three control-plane machines would be,
openshift-install gather bootstrap --bootstrap ${BOOTSTRAP_HOST_IP} --master ${CONTROL_PLANE_1_HOST_IP} --master ${CONTROL_PLANE_2_HOST_IP} --master ${CONTROL_PLANE_3_HOST_IP}
When explicitly using the gather bootstrap
subcommand, user can either utilize the installer's discovery mechanism like detailed [above](#authenticating-with bootstrap host-for-ipi) or provide the keys using the --key
flag.
An example of a invocation for a cluster with three control-plane machines would be,
openshift-install gather bootstrap --key ${KEY_1} --key ${KEY_2} --bootstrap ${BOOTSTRAP_HOST_IP} --master ${CONTROL_PLANE_1_HOST_IP} --master ${CONTROL_PLANE_2_HOST_IP} --master ${CONTROL_PLANE_3_HOST_IP}
Here's what a log bundle looks like,
.
├── bootstrap
├── control-plane
├── failed-units.txt
├── rendered-assets
├── resources
└── unit-status
5 directories, 1 file
The failed-units.txt contains a list of all the failed systemd units on the bootstrap host.
The unit-status directory contains the details of each failed systemd unit from failed-units,
The bootstrap directory consists of all the important logs and files from the bootstrap host. There are three subdirectories for the bootstrap host
bootstrap
├── containers
├── journals
└── pods
3 directories, 0 files
The containers directory contains the descriptions and logs from all the containers created by the kubelet using CRI-O for the static pods. This directory contains all the operators or their operands running on the bootstrap host in special bootstrap modes. For example the machine-config-server container, or the bootstrap-kube-controlplane pods etc.
For each container the directory has two files,
<human readable id>.log
, which contains the log of the container.<human readable id>.inspect
, which contains the information about the container like the image, volume mounts, arguments etc.
The journals directory contains the logs for important systemd units. These units are,
release-image.log
, the release-image unit is responsible for pulling the Release Image to the bootstrap host.crio-configure.log
andcrio.log
, these units are responsible for configuring the CRI-O on the bootstrap host and CRI-O daemon respectively.kubelet.log
, the kubelet service is responsible for running the kubelet on the bootstrap host. The kubelet on the bootstrap host is responsible for running the static pods for etcd, bootstrap-kube-controlplane and various other operators in bootstrap mode.approve-csr.log
, the approve-csr unit is responsible for allowing control-plane machines to join OpenShift cluster. This unit performs the job of in-cluster approver while the bootstrapping is in progress.bootkube.log
, the bootkube service is the unit that performs the bootstrapping of OpenShift clusters using all the operators. This service is responsible for running all the required steps to bootstrap the API and then wait for success.
There might also be other services that are important for some platforms like OpenStack, that will have logs in this directory.
The pods directory contains the information and logs from all the render commands for various operators run by the bootkube unit.
For each container the directory has two files,
<human readable id>.log
, which contains the log of the container.<human readable id>.inspect
, which contains the information about the container like the image, volume mounts, arguments etc.
The resources directory contains various Kubernetes objects that are present in the cluster. These resources are pulled using the bootstrap API running on the bootstrap host.
The rendered-assets directory contains all the files and directories created by the bootkube unit using various render command for operators. This directory is a snapshot of the /opt/openshift
directory on the bootstrap-host.
The control-plane directory contains logs for each control-plane host. It contains a sub directory for each control-plane host, usually the IP address of the hosts.
control-plane
├── 10.0.128.114
│ ├── containers
│ ├── failed-units.txt
│ ├── journals
│ └── unit-status
├── 10.0.142.138
│ ├── containers
│ ├── failed-units.txt
│ ├── journals
│ └── unit-status
└── 10.0.148.48
├── containers
├── failed-units.txt
├── journals
└── unit-status
12 directories, 3 files
The containers directory contains the descriptions and logs from all the containers created by the kubelet using CRI-O on the control-plane host. The files are same as containers directory on bootstrap host.
The journals directory contains the logs of important units on the control plane hosts. The list of such units is,
crio.log
kubelet.log
machine-config-daemon-host.log
andpivot.log
, these files have logs for RHCOS pivot related actions on the control plane host.
Here are some common failures that the users can troubleshoot using the bootstrap failure log bundle.
-
Attempted to gather debug logs after installation failure: failed to create SSH client: failed to initialize the SSH agent: no keys found for SSH agent
The installer tried to create a new SSH agent, but there were no keys found in user's home directory, usually~/.ssh
on Linux. The user can use the--key
flag to provide the private key for SSH to gather the bootstrap failure logs. -
failed to create SSH client: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain
The keys provided to the installer from theSSH_AGENT
or the keys loaded from user's home directory do not have permission to SSH to the bootstrap host. The user can use the--key
flag to provide the private key for SSH to gather the bootstrap failure logs.
When the pull secret provided to the installer does not have correct permissions to pull the Release Image, the bootstrap/journals/release-image.log
should contain the debugging logs.
For example,
-- Logs begin at Fri 2020-04-24 17:08:15 UTC, end at Fri 2020-04-24 17:33:16 UTC. --
Apr 24 17:08:46 ci-op-2cbvx-bootstrap.c.openshift-gce-devel-ci.internal systemd[1]: Starting Download the OpenShift Release Image...
Apr 24 17:08:46 ci-op-2cbvx-bootstrap.c.openshift-gce-devel-ci.internal release-image-download.sh[1688]: Pulling registry.svc.ci.openshift.org/ci-op-8dv01g3m/release@sha256:50b07a8b4529d8fd2ac6c23ecc311034a3b86cada41c948baaced8c6a46077bc...
Apr 24 17:08:49 ci-op-2cbvx-bootstrap.c.openshift-gce-devel-ci.internal podman[1698]: 2020-04-24 17:08:49.307961668 +0000 UTC m=+1.119158273 system refresh
Apr 24 17:08:49 ci-op-2cbvx-bootstrap.c.openshift-gce-devel-ci.internal release-image-download.sh[1688]: Error: error pulling image "registry.svc.ci.openshift.org/ci-op-8dv01g3m/release@sha256:50b07a8b4529d8fd2ac6c23ecc311034a3b86cada41c948baaced8c6a46077bc": unable to pull registry.svc.ci.openshift.org/ci-op-8dv01g3m/release@sha256:50b07a8b4529d8fd2ac6c23ecc311034a3b86cada41c948baaced8c6a46077bc: unable to pull image: Error initializing source docker://registry.svc.ci.openshift.org/ci-op-8dv01g3m/release@sha256:50b07a8b4529d8fd2ac6c23ecc311034a3b86cada41c948baaced8c6a46077bc: Error reading manifest sha256:50b07a8b4529d8fd2ac6c23ecc311034a3b86cada41c948baaced8c6a46077bc in registry.svc.ci.openshift.org/ci-op-8dv01g3m/release: unauthorized: authentication required
For cases where the bootkube logs are empty in bootstrap/journals/bootkube.log
like,
-- Logs begin at Fri 2020-04-24 17:08:15 UTC, end at Fri 2020-04-24 17:33:16 UTC. --
-- No entries --
There is high likelihood that the Release Image cannot be downloaded and more details can be found using release-image.log
When the control-plane logs are missing from the log bundle, for example,
$ tree control-plane -L 2
control-plane
├── 10.0.0.4
├── 10.0.0.5
└── 10.0.0.6
3 directories, 0 files
The troubleshooting would require the logs of the installer gathering the log bundle, which are easily available in .openshift_install.log
.