Skip to content

Commit

Permalink
Merge pull request #688 from red-hat-storage/sync_ds--master
Browse files Browse the repository at this point in the history
Syncing latest changes from master for rook
  • Loading branch information
travisn authored Jul 31, 2024
2 parents 453dc30 + 8388c7c commit 99a50bd
Show file tree
Hide file tree
Showing 11 changed files with 182 additions and 45 deletions.
73 changes: 54 additions & 19 deletions Documentation/CRDs/Cluster/external-cluster/external-cluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,25 +121,6 @@ The storageclass is used to create a volume in the pool matching the topology wh

For more details, see the [Topology-Based Provisioning](topology-for-external-mode.md)

### Upgrade Example

1. If consumer cluster doesn't have restricted caps, this will upgrade all the default csi-users (non-restricted):

```console
python3 create-external-cluster-resources.py --upgrade
```

2. If the consumer cluster has restricted caps:
Restricted users created using `--restricted-auth-permission` flag need to pass mandatory flags: '`--rbd-data-pool-name`(if it is a rbd user), `--k8s-cluster-name` and `--run-as-user`' flags while upgrading, in case of cephfs users if you have passed `--cephfs-filesystem-name` flag while creating csi-users then while upgrading it will be mandatory too. In this example the user would be `client.csi-rbd-node-rookstorage-replicapool` (following the pattern `csi-user-clusterName-poolName`)

```console
python3 create-external-cluster-resources.py --upgrade --rbd-data-pool-name replicapool --k8s-cluster-name rookstorage --run-as-user client.csi-rbd-node-rookstorage-replicapool
```

!!! note
An existing non-restricted user cannot be converted to a restricted user by upgrading.
The upgrade flag should only be used to append new permissions to users. It shouldn't be used for changing a csi user already applied permissions. For example, you shouldn't change the pool(s) a user has access to.

### Admin privileges

If in case the cluster needs the admin keyring to configure, update the admin key `rook-ceph-mon` secret with client.admin keyring
Expand Down Expand Up @@ -305,3 +286,57 @@ you can export the settings from this cluster with the following steps.

!!! important
For other clusters to connect to storage in this cluster, Rook must be configured with a networking configuration that is accessible from other clusters. Most commonly this is done by enabling host networking in the CephCluster CR so the Ceph daemons will be addressable by their host IPs.

## Upgrades

Upgrading the cluster would be different for restricted caps and non-restricted caps,

1. If consumer cluster doesn't have restricted caps, this will upgrade all the default CSI users (non-restricted)

```console
python3 create-external-cluster-resources.py --upgrade
```

2. If the consumer cluster has restricted caps

Restricted users created using `--restricted-auth-permission` flag need to pass mandatory flags: '`--rbd-data-pool-name`(if it is a rbd user), `--k8s-cluster-name` and `--run-as-user`' flags while upgrading, in case of cephfs users if you have passed `--cephfs-filesystem-name` flag while creating CSI users then while upgrading it will be mandatory too. In this example the user would be `client.csi-rbd-node-rookstorage-replicapool` (following the pattern `csi-user-clusterName-poolName`)

```console
python3 create-external-cluster-resources.py --upgrade --rbd-data-pool-name replicapool --k8s-cluster-name rookstorage --run-as-user client.csi-rbd-node-rookstorage-replicapool
```

!!! note
1) An existing non-restricted user cannot be converted to a restricted user by upgrading.
2) The upgrade flag should only be used to append new permissions to users. It shouldn't be used for changing a CSI user already applied permissions. For example, be careful not to change pools(s) that a user has access to.

### Upgrade cluster to utilize new feature

Some Rook upgrades may require re-running the import steps, or may introduce new external cluster features that can be most easily enabled by re-running the import steps.

To re-run the import steps with new options, the python script should be re-run using the same configuration options that were used for past invocations, plus the configurations that are being added or modified.

Starting with Rook v1.15, the script stores the configuration in the external-cluster-user-command configmap for easy future reference.

* arg: Exact arguments that were used for for processing the script. Argument that are decided using the Priority: command-line-args > config.ini file values > default values.

#### Example `external-cluster-user-command` ConfigMap:

1. Get the last-applied config, if its available

```console
$ kubectl get configmap -namespace rook-ceph external-cluster-user-command --output jsonpath='{.data.args}'
```

2. Copy the output to config.ini

3. Make any desired modifications and additions to `config.ini``

4. Run the python script again using the [config file](#config-file)

5. [Copy the bash output](#2-copy-the-bash-output)

6. Run the steps under [import-the-source-data](#import-the-source-data)

!!! warning
If the last-applied config is unavailable, run the current version of the script again using previously-applied config and CLI flags.
Failure to reuse the same configuration options when re-invoking the python script can result in unexpected changes when re-running the import script.
36 changes: 36 additions & 0 deletions Documentation/Troubleshooting/ceph-common-issues.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,9 @@ title: Ceph Common Issues
- [Symptoms](#symptoms-11)
- [Investigation](#investigation-7)
- [Solution](#solution-12)
- [The cluster is in an unhealthy state or fails to configure when LimitNOFILE=infinity in containerd](#the-cluster-is-in-an-unhealthy-state-or-fails-to-configure-when-limitnofileinfinity-in-containerd)
- [Symptoms](#symptoms-12)
- [Solution](#solution-13)


Many of these problem cases are hard to summarize down to a short phrase that adequately describes the problem. Each problem will start with a bulleted list of symptoms. Keep in mind that all symptoms may not apply depending on the configuration of Rook. If the majority of the symptoms are seen there is a fair chance you are experiencing that problem.
Expand Down Expand Up @@ -774,3 +777,36 @@ data: {}
```
If the ConfigMap exists, remove any keys that you wish to configure through the environment.
## The cluster is in an unhealthy state or fails to configure when LimitNOFILE=infinity in containerd
### Symptoms
When trying to create a new deployment, Ceph mons keep crashing and the cluster fails to configure or remains in an unhealthy state. The nodes' CPUs are stuck at 100%.
```console
NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL FSID
rook-ceph /var/lib/rook 3 4m6s Ready Failed to configure ceph cluster HEALTH_ERR
```

### Solution

Before systemd v240, systemd would leave `fs.nr_open` as-is because it had no mechanism to set a safe upper limit for it. The kernel hard-coded value for the default number of max open files is **1048576**. Starting from systemd v240, when `LimitNOFILE=infinity` is specified in the containerd.service configuration, this value will typically be set to **~1073741816** (INT_MAX for x86_64 divided by two).

To fix this, set LimitNOFILE in the systemd service configuration to **1048576**.

Create an override.conf file with the new LimitNOFILE value:

```console
$ vim /etc/systemd/system/containerd.service.d/override.conf
[Service]
LimitNOFILE=1048576
```

Reload systemd manager configuration, restart containerd and restart all monitors deployments:

```console
$ systemctl daemon-reload
$ systemctl restart containerd
$ kubectl rollout restart deployment rook-ceph-mon-a rook-ceph-mon-b rook-ceph-mon-c -n rook-ceph
```
1 change: 1 addition & 0 deletions build/csv/ceph/ceph.rook.io_cephclusters.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -449,6 +449,7 @@ spec:
- ""
- crush-compat
- upmap
- read
- upmap-read
type: string
type: object
Expand Down
1 change: 1 addition & 0 deletions deploy/charts/rook-ceph/templates/resources.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1601,6 +1601,7 @@ spec:
- ""
- crush-compat
- upmap
- read
- upmap-read
type: string
type: object
Expand Down
1 change: 1 addition & 0 deletions deploy/examples/crds.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1599,6 +1599,7 @@ spec:
- ""
- crush-compat
- upmap
- read
- upmap-read
type: string
type: object
Expand Down
6 changes: 3 additions & 3 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ replace (

require (
github.com/IBM/keyprotect-go-client v0.14.3
github.com/aws/aws-sdk-go v1.54.20
github.com/aws/aws-sdk-go v1.55.3
github.com/banzaicloud/k8s-objectmatcher v1.8.0
github.com/ceph/go-ceph v0.28.0
github.com/coreos/pkg v0.0.0-20230601102743-20bbbf26f4d8
Expand All @@ -30,8 +30,8 @@ require (
github.com/kube-object-storage/lib-bucket-provisioner v0.0.0-20221122204822-d1a8c34382f1
github.com/libopenstorage/secrets v0.0.0-20240416031220-a17cf7f72c6c
github.com/pkg/errors v0.9.1
github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring v0.75.1
github.com/prometheus-operator/prometheus-operator/pkg/client v0.75.1
github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring v0.75.2
github.com/prometheus-operator/prometheus-operator/pkg/client v0.75.2
github.com/rook/rook/pkg/apis v0.0.0-20231204200402-5287527732f7
github.com/spf13/cobra v1.8.1
github.com/spf13/pflag v1.0.5
Expand Down
12 changes: 6 additions & 6 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -144,8 +144,8 @@ github.com/armon/go-socks5 v0.0.0-20160902184237-e75332964ef5/go.mod h1:wHh0iHkY
github.com/asaskevich/govalidator v0.0.0-20180720115003-f9ffefc3facf/go.mod h1:lB+ZfQJz7igIIfQNfa7Ml4HSf2uFQQRzpGGRXenZAgY=
github.com/asaskevich/govalidator v0.0.0-20190424111038-f61b66f89f4a/go.mod h1:lB+ZfQJz7igIIfQNfa7Ml4HSf2uFQQRzpGGRXenZAgY=
github.com/aws/aws-sdk-go v1.44.164/go.mod h1:aVsgQcEevwlmQ7qHE9I3h+dtQgpqhFB+i8Phjh7fkwI=
github.com/aws/aws-sdk-go v1.54.20 h1:FZ2UcXya7bUkvkpf7TaPmiL7EubK0go1nlXGLRwEsoo=
github.com/aws/aws-sdk-go v1.54.20/go.mod h1:eRwEWoyTWFMVYVQzKMNHWP5/RV4xIUGMQfXQHfHkpNU=
github.com/aws/aws-sdk-go v1.55.3 h1:0B5hOX+mIx7I5XPOrjrHlKSDQV/+ypFZpIHOx5LOk3E=
github.com/aws/aws-sdk-go v1.55.3/go.mod h1:eRwEWoyTWFMVYVQzKMNHWP5/RV4xIUGMQfXQHfHkpNU=
github.com/banzaicloud/k8s-objectmatcher v1.8.0 h1:Nugn25elKtPMTA2br+JgHNeSQ04sc05MDPmpJnd1N2A=
github.com/banzaicloud/k8s-objectmatcher v1.8.0/go.mod h1:p2LSNAjlECf07fbhDyebTkPUIYnU05G+WfGgkTmgeMg=
github.com/benbjohnson/clock v1.1.0/go.mod h1:J11/hYXuz8f4ySSvYwY0FKfm+ezbsZBKZxNJlLklBHA=
Expand Down Expand Up @@ -770,11 +770,11 @@ github.com/prashantv/gostub v1.1.0 h1:BTyx3RfQjRHnUWaGF9oQos79AlQ5k8WNktv7VGvVH4
github.com/prashantv/gostub v1.1.0/go.mod h1:A5zLQHz7ieHGG7is6LLXLz7I8+3LZzsrV0P1IAHhP5U=
github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring v0.44.1/go.mod h1:3WYi4xqXxGGXWDdQIITnLNmuDzO5n6wYva9spVhR4fg=
github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring v0.46.0/go.mod h1:3WYi4xqXxGGXWDdQIITnLNmuDzO5n6wYva9spVhR4fg=
github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring v0.75.1 h1:+iiljhJV6niK7MuifJs/n3NeLxikd85nrQfn53sLJkU=
github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring v0.75.1/go.mod h1:XYrdZw5dW12Cjkt4ndbeNZZTBp4UCHtW0ccR9+sTtPU=
github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring v0.75.2 h1:6UsAv+jAevuGO2yZFU/BukV4o9NKnFMOuoouSA4G0ns=
github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring v0.75.2/go.mod h1:XYrdZw5dW12Cjkt4ndbeNZZTBp4UCHtW0ccR9+sTtPU=
github.com/prometheus-operator/prometheus-operator/pkg/client v0.46.0/go.mod h1:k4BrWlVQQsvBiTcDnKEMgyh/euRxyxgrHdur/ZX/sdA=
github.com/prometheus-operator/prometheus-operator/pkg/client v0.75.1 h1:s7GlsRYGLWP+L1eQKy6RmLatX+k3v9NQwutUix4l5uM=
github.com/prometheus-operator/prometheus-operator/pkg/client v0.75.1/go.mod h1:qca3qWGdknRpHvPyThepe5a6QYAh38IQ2ml93E6V3NY=
github.com/prometheus-operator/prometheus-operator/pkg/client v0.75.2 h1:71GOmhZFA2/17maXqCcuJEzpJDyqPty8SpEOGZWyVec=
github.com/prometheus-operator/prometheus-operator/pkg/client v0.75.2/go.mod h1:Sv6XsfGGkR9gKnhP92F5dNXEpsSePn0W+7JwYP0NVkc=
github.com/prometheus/client_golang v0.9.0/go.mod h1:7SWBe2y4D6OKWSNQJUaRYU/AaXPKyh/dDVn+NZz0KFw=
github.com/prometheus/client_golang v0.9.1/go.mod h1:7SWBe2y4D6OKWSNQJUaRYU/AaXPKyh/dDVn+NZz0KFw=
github.com/prometheus/client_golang v0.9.3/go.mod h1:/TN21ttK/J9q6uSwhBd54HahCDft0ttaMvbicHlPoso=
Expand Down
2 changes: 1 addition & 1 deletion pkg/apis/ceph.rook.io/v1/types.go
Original file line number Diff line number Diff line change
Expand Up @@ -679,7 +679,7 @@ type Module struct {

type ModuleSettings struct {
// BalancerMode sets the `balancer` module with different modes like `upmap`, `crush-compact` etc
// +kubebuilder:validation:Enum="";crush-compat;upmap;upmap-read
// +kubebuilder:validation:Enum="";crush-compat;upmap;read;upmap-read
BalancerMode string `json:"balancerMode,omitempty"`
}

Expand Down
36 changes: 30 additions & 6 deletions pkg/daemon/ceph/client/mgr.go
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,18 @@ import (

"github.com/pkg/errors"
"github.com/rook/rook/pkg/clusterd"
cephver "github.com/rook/rook/pkg/operator/ceph/version"
)

var (
moduleEnableWaitTime = 5 * time.Second
)

const (
readBalancerMode = "read"
upmapReadBalancerMode = "upmap-read"
)

func CephMgrMap(context *clusterd.Context, clusterInfo *ClusterInfo) (*MgrMap, error) {
args := []string{"mgr", "dump"}
buf, err := NewCephCommand(context, clusterInfo, args).Run()
Expand Down Expand Up @@ -132,12 +138,12 @@ func setBalancerMode(context *clusterd.Context, clusterInfo *ClusterInfo, mode s
return nil
}

// setMinCompatClientLuminous set the minimum compatibility for clients to Luminous
func setMinCompatClientLuminous(context *clusterd.Context, clusterInfo *ClusterInfo) error {
args := []string{"osd", "set-require-min-compat-client", "luminous", "--yes-i-really-mean-it"}
// setMinCompatClient set the minimum compatibility for clients
func setMinCompatClient(context *clusterd.Context, clusterInfo *ClusterInfo, version string) error {
args := []string{"osd", "set-require-min-compat-client", version, "--yes-i-really-mean-it"}
_, err := NewCephCommand(context, clusterInfo, args).Run()
if err != nil {
return errors.Wrap(err, "failed to set set-require-min-compat-client to luminous")
return errors.Wrapf(err, "failed to set set-require-min-compat-client to %q", version)
}

return nil
Expand Down Expand Up @@ -165,8 +171,12 @@ func mgrSetBalancerMode(context *clusterd.Context, clusterInfo *ClusterInfo, bal

// ConfigureBalancerModule configures the balancer module
func ConfigureBalancerModule(context *clusterd.Context, clusterInfo *ClusterInfo, balancerModuleMode string) error {
// Set min compat client to luminous before enabling the balancer mode "upmap"
err := setMinCompatClientLuminous(context, clusterInfo)
minCompatClientVersion, err := desiredMinCompatClientVersion(clusterInfo, balancerModuleMode)
if err != nil {
return errors.Wrap(err, "failed to get minimum compatibility client version")
}

err = setMinCompatClient(context, clusterInfo, minCompatClientVersion)
if err != nil {
return errors.Wrap(err, "failed to set minimum compatibility client")
}
Expand All @@ -179,3 +189,17 @@ func ConfigureBalancerModule(context *clusterd.Context, clusterInfo *ClusterInfo

return nil
}

func desiredMinCompatClientVersion(clusterInfo *ClusterInfo, balancerModuleMode string) (string, error) {
// Set min compat client to luminous before enabling the balancer mode "upmap"
minCompatClientVersion := "luminous"
if balancerModuleMode == readBalancerMode || balancerModuleMode == upmapReadBalancerMode {
if !clusterInfo.CephVersion.IsAtLeast(cephver.CephVersion{Major: 19}) {
return "", errors.New("minimum ceph v19 (Squid) is required for upmap-read or read balancer modes")
}
// Set min compat client to reef before enabling the balancer mode "upmap-read" or "read"
minCompatClientVersion = "reef"
}

return minCompatClientVersion, nil
}
35 changes: 35 additions & 0 deletions pkg/daemon/ceph/client/mgr_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ import (

"github.com/pkg/errors"
"github.com/rook/rook/pkg/clusterd"
cephver "github.com/rook/rook/pkg/operator/ceph/version"
exectest "github.com/rook/rook/pkg/util/exec/test"
"github.com/stretchr/testify/assert"
)
Expand Down Expand Up @@ -135,3 +136,37 @@ func TestSetBalancerMode(t *testing.T) {
err := setBalancerMode(&clusterd.Context{Executor: executor}, AdminTestClusterInfo("mycluster"), "upmap")
assert.NoError(t, err)
}

func TestGetMinCompatClientVersion(t *testing.T) {
clusterInfo := AdminTestClusterInfo("mycluster")
t.Run("upmap-read balancer mode with ceph v19", func(t *testing.T) {
clusterInfo.CephVersion = cephver.CephVersion{Major: 19}
result, err := desiredMinCompatClientVersion(clusterInfo, upmapReadBalancerMode)
assert.NoError(t, err)
assert.Equal(t, "reef", result)
})

t.Run("read balancer mode with ceph v19", func(t *testing.T) {
clusterInfo.CephVersion = cephver.CephVersion{Major: 19}
result, err := desiredMinCompatClientVersion(clusterInfo, readBalancerMode)
assert.NoError(t, err)
assert.Equal(t, "reef", result)
})
t.Run("upmap-read balancer mode with ceph below v19 should fail", func(t *testing.T) {
clusterInfo.CephVersion = cephver.CephVersion{Major: 18}
_, err := desiredMinCompatClientVersion(clusterInfo, upmapReadBalancerMode)
assert.Error(t, err)
})
t.Run("read balancer mode with ceph below v19 should fail", func(t *testing.T) {
clusterInfo.CephVersion = cephver.CephVersion{Major: 18}
_, err := desiredMinCompatClientVersion(clusterInfo, readBalancerMode)
assert.Error(t, err)
})

t.Run("upmap balancer set min compat client to luminous", func(t *testing.T) {
clusterInfo.CephVersion = cephver.CephVersion{Major: 19}
result, err := desiredMinCompatClientVersion(clusterInfo, "upmap")
assert.NoError(t, err)
assert.Equal(t, "luminous", result)
})
}
Loading

0 comments on commit 99a50bd

Please sign in to comment.