Skip to content

Commit

Permalink
Merge pull request #661 from red-hat-storage/sync_us--master
Browse files Browse the repository at this point in the history
Syncing latest changes from upstream master for rook
  • Loading branch information
travisn authored May 31, 2024
2 parents 1f93f50 + 9824dec commit bcb7530
Show file tree
Hide file tree
Showing 11 changed files with 286 additions and 2 deletions.
3 changes: 3 additions & 0 deletions Documentation/CRDs/Cluster/ceph-cluster-crd.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,9 @@ For more details on the mons and when to choose a number other than `3`, see the
* For non-PVCs: `placement.all` and `placement.osd`
* For PVCs: `placement.all` and inside the storageClassDeviceSets from the `placement` or `preparePlacement`
* `flappingRestartIntervalHours`: Defines the time for which an OSD pod will sleep before restarting, if it stopped due to flapping. Flapping occurs where OSDs are marked `down` by Ceph more than 5 times in 600 seconds. The OSDs will stay down when flapping since they likely have a bad disk or other issue that needs investigation. If the issue with the OSD is fixed manually, the OSD pod can be manually restarted. The sleep is disabled if this interval is set to 0.
* `fullRatio`: The ratio at which Ceph should block IO if the OSDs are too full. The default is 0.95.
* `backfillFullRatio`: The ratio at which Ceph should stop backfilling data if the OSDs are too full. The default is 0.90.
* `nearFullRatio`: The ratio at which Ceph should raise a health warning if the cluster is almost full. The default is 0.85.
* `disruptionManagement`: The section for configuring management of daemon disruptions
* `managePodBudgets`: if `true`, the operator will create and manage PodDisruptionBudgets for OSD, Mon, RGW, and MDS daemons. OSD PDBs are managed dynamically via the strategy outlined in the [design](https://github.com/rook/rook/blob/master/design/ceph/ceph-managed-disruptionbudgets.md). The operator will block eviction of OSDs by default and unblock them safely when drains are detected.
* `osdMaintenanceTimeout`: is a duration in minutes that determines how long an entire failureDomain like `region/zone/host` will be held in `noout` (in addition to the default DOWN/OUT interval) when it is draining. The default value is `30` minutes.
Expand Down
36 changes: 36 additions & 0 deletions Documentation/CRDs/specification.md
Original file line number Diff line number Diff line change
Expand Up @@ -12170,6 +12170,42 @@ User needs to manually restart the OSD pod if they manage to fix the underlying
The sleep will be disabled if this interval is set to 0.</p>
</td>
</tr>
<tr>
<td>
<code>fullRatio</code><br/>
<em>
float64
</em>
</td>
<td>
<em>(Optional)</em>
<p>FullRatio is the ratio at which the cluster is considered full and ceph will stop accepting writes. Default is 0.95.</p>
</td>
</tr>
<tr>
<td>
<code>nearFullRatio</code><br/>
<em>
float64
</em>
</td>
<td>
<em>(Optional)</em>
<p>NearFullRatio is the ratio at which the cluster is considered nearly full and will raise a ceph health warning. Default is 0.85.</p>
</td>
</tr>
<tr>
<td>
<code>backfillFullRatio</code><br/>
<em>
float64
</em>
</td>
<td>
<em>(Optional)</em>
<p>BackfillFullRatio is the ratio at which the cluster is too full for backfill. Backfill will be disabled if above this threshold. Default is 0.90.</p>
</td>
</tr>
</tbody>
</table>
<h3 id="ceph.rook.io/v1.StoreType">StoreType
Expand Down
15 changes: 15 additions & 0 deletions build/csv/ceph/ceph.rook.io_cephclusters.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1502,6 +1502,11 @@ spec:
storage:
nullable: true
properties:
backfillFullRatio:
maximum: 1
minimum: 0
nullable: true
type: number
config:
additionalProperties:
type: string
Expand Down Expand Up @@ -1531,6 +1536,16 @@ spec:
x-kubernetes-preserve-unknown-fields: true
flappingRestartIntervalHours:
type: integer
fullRatio:
maximum: 1
minimum: 0
nullable: true
type: number
nearFullRatio:
maximum: 1
minimum: 0
nullable: true
type: number
nodes:
items:
properties:
Expand Down
18 changes: 18 additions & 0 deletions deploy/charts/rook-ceph/templates/resources.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3152,6 +3152,12 @@ spec:
description: A spec for available storage in the cluster and how it should be used
nullable: true
properties:
backfillFullRatio:
description: BackfillFullRatio is the ratio at which the cluster is too full for backfill. Backfill will be disabled if above this threshold. Default is 0.90.
maximum: 1
minimum: 0
nullable: true
type: number
config:
additionalProperties:
type: string
Expand Down Expand Up @@ -3192,6 +3198,18 @@ spec:
User needs to manually restart the OSD pod if they manage to fix the underlying OSD flapping issue before the restart interval.
The sleep will be disabled if this interval is set to 0.
type: integer
fullRatio:
description: FullRatio is the ratio at which the cluster is considered full and ceph will stop accepting writes. Default is 0.95.
maximum: 1
minimum: 0
nullable: true
type: number
nearFullRatio:
description: NearFullRatio is the ratio at which the cluster is considered nearly full and will raise a ceph health warning. Default is 0.85.
maximum: 1
minimum: 0
nullable: true
type: number
nodes:
items:
description: Node is a storage nodes
Expand Down
6 changes: 6 additions & 0 deletions deploy/examples/cluster.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -272,6 +272,12 @@ spec:
onlyApplyOSDPlacement: false
# Time for which an OSD pod will sleep before restarting, if it stopped due to flapping
# flappingRestartIntervalHours: 24
# The ratio at which Ceph should block IO if the OSDs are too full. The default is 0.95.
# fullRatio: 0.95
# The ratio at which Ceph should stop backfilling data if the OSDs are too full. The default is 0.90.
# backfillFullRatio: 0.90
# The ratio at which Ceph should raise a health warning if the OSDs are almost full. The default is 0.85.
# nearFullRatio: 0.85
# The section for configuring management of daemon disruptions during upgrade or fencing.
disruptionManagement:
# If true, the operator will create and manage PodDisruptionBudgets for OSD, Mon, RGW, and MDS daemons. OSD PDBs are managed dynamically
Expand Down
18 changes: 18 additions & 0 deletions deploy/examples/crds.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3150,6 +3150,12 @@ spec:
description: A spec for available storage in the cluster and how it should be used
nullable: true
properties:
backfillFullRatio:
description: BackfillFullRatio is the ratio at which the cluster is too full for backfill. Backfill will be disabled if above this threshold. Default is 0.90.
maximum: 1
minimum: 0
nullable: true
type: number
config:
additionalProperties:
type: string
Expand Down Expand Up @@ -3190,6 +3196,18 @@ spec:
User needs to manually restart the OSD pod if they manage to fix the underlying OSD flapping issue before the restart interval.
The sleep will be disabled if this interval is set to 0.
type: integer
fullRatio:
description: FullRatio is the ratio at which the cluster is considered full and ceph will stop accepting writes. Default is 0.95.
maximum: 1
minimum: 0
nullable: true
type: number
nearFullRatio:
description: NearFullRatio is the ratio at which the cluster is considered nearly full and will raise a ceph health warning. Default is 0.85.
maximum: 1
minimum: 0
nullable: true
type: number
nodes:
items:
description: Node is a storage nodes
Expand Down
18 changes: 18 additions & 0 deletions pkg/apis/ceph.rook.io/v1/types.go
Original file line number Diff line number Diff line change
Expand Up @@ -2839,6 +2839,24 @@ type StorageScopeSpec struct {
// User needs to manually restart the OSD pod if they manage to fix the underlying OSD flapping issue before the restart interval.
// The sleep will be disabled if this interval is set to 0.
FlappingRestartIntervalHours int `json:"flappingRestartIntervalHours"`
// FullRatio is the ratio at which the cluster is considered full and ceph will stop accepting writes. Default is 0.95.
// +kubebuilder:validation:Minimum=0.0
// +kubebuilder:validation:Maximum=1.0
// +optional
// +nullable
FullRatio *float64 `json:"fullRatio,omitempty"`
// NearFullRatio is the ratio at which the cluster is considered nearly full and will raise a ceph health warning. Default is 0.85.
// +kubebuilder:validation:Minimum=0.0
// +kubebuilder:validation:Maximum=1.0
// +optional
// +nullable
NearFullRatio *float64 `json:"nearFullRatio,omitempty"`
// BackfillFullRatio is the ratio at which the cluster is too full for backfill. Backfill will be disabled if above this threshold. Default is 0.90.
// +kubebuilder:validation:Minimum=0.0
// +kubebuilder:validation:Maximum=1.0
// +optional
// +nullable
BackfillFullRatio *float64 `json:"backfillFullRatio,omitempty"`
}

// OSDStore is the backend storage type used for creating the OSDs
Expand Down
7 changes: 5 additions & 2 deletions pkg/daemon/ceph/client/osd.go
Original file line number Diff line number Diff line change
Expand Up @@ -65,8 +65,11 @@ type OSDDump struct {
Up json.Number `json:"up"`
In json.Number `json:"in"`
} `json:"osds"`
Flags string `json:"flags"`
CrushNodeFlags map[string][]string `json:"crush_node_flags"`
Flags string `json:"flags"`
CrushNodeFlags map[string][]string `json:"crush_node_flags"`
FullRatio float64 `json:"full_ratio"`
BackfillFullRatio float64 `json:"backfillfull_ratio"`
NearFullRatio float64 `json:"nearfull_ratio"`
}

// IsFlagSet checks if an OSD flag is set
Expand Down
64 changes: 64 additions & 0 deletions pkg/operator/ceph/cluster/cluster.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ package cluster
import (
"context"
"fmt"
"math"
"os"
"os/exec"
"path"
Expand Down Expand Up @@ -474,6 +475,10 @@ func (c *cluster) postMonStartupActions() error {
return errors.Wrap(err, "")
}

if err := c.configureStorageSettings(); err != nil {
return errors.Wrap(err, "failed to configure storage settings")
}

crushRoot := client.GetCrushRootFromSpec(c.Spec)
if crushRoot != "default" {
// Remove the root=default and replicated_rule which are created by
Expand All @@ -492,6 +497,65 @@ func (c *cluster) postMonStartupActions() error {
return nil
}

func (c *cluster) configureStorageSettings() error {
if !c.shouldSetClusterFullSettings() {
return nil
}
osdDump, err := client.GetOSDDump(c.context, c.ClusterInfo)
if err != nil {
return errors.Wrap(err, "failed to get osd dump for setting cluster full settings")
}

if err := c.setClusterFullRatio("set-full-ratio", c.Spec.Storage.FullRatio, osdDump.FullRatio); err != nil {
return err
}

if err := c.setClusterFullRatio("set-backfillfull-ratio", c.Spec.Storage.BackfillFullRatio, osdDump.BackfillFullRatio); err != nil {
return err
}

if err := c.setClusterFullRatio("set-nearfull-ratio", c.Spec.Storage.NearFullRatio, osdDump.NearFullRatio); err != nil {
return err
}

return nil
}

func (c *cluster) setClusterFullRatio(ratioCommand string, desiredRatio *float64, actualRatio float64) error {
if !shouldUpdateFloatSetting(desiredRatio, actualRatio) {
if desiredRatio != nil {
logger.Infof("desired value %s=%.2f is already set", ratioCommand, *desiredRatio)
}
return nil
}
desiredStringVal := fmt.Sprintf("%.2f", *desiredRatio)
logger.Infof("updating %s from %.2f to %s", ratioCommand, actualRatio, desiredStringVal)
args := []string{"osd", ratioCommand, desiredStringVal}
cephCmd := client.NewCephCommand(c.context, c.ClusterInfo, args)
output, err := cephCmd.Run()
if err != nil {
return errors.Wrapf(err, "failed to update %s to %q. %s", ratioCommand, desiredStringVal, output)
}
return nil
}

func shouldUpdateFloatSetting(desired *float64, actual float64) bool {
if desired == nil {
return false
}
if *desired == actual {
return false
}
if actual != 0 && math.Abs(*desired-actual)/actual > 0.01 {
return true
}
return false
}

func (c *cluster) shouldSetClusterFullSettings() bool {
return c.Spec.Storage.FullRatio != nil || c.Spec.Storage.BackfillFullRatio != nil || c.Spec.Storage.NearFullRatio != nil
}

func (c *cluster) updateConfigStoreFromCRD() error {
monStore := config.GetMonStore(c.context, c.ClusterInfo)
return monStore.SetAllMultiple(c.Spec.CephConfig)
Expand Down
95 changes: 95 additions & 0 deletions pkg/operator/ceph/cluster/cluster_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -335,3 +335,98 @@ func TestTelemetry(t *testing.T) {
c.reportTelemetry()
})
}
func TestClusterFullSettings(t *testing.T) {
actualFullRatio := 0.95
actualBackfillFullRatio := 0.90
actualNearFullRatio := 0.85
setFullRatio := false
setBackfillFullRatio := false
setNearFullRatio := false
clientset := testop.New(t, 1)
context := &clusterd.Context{Clientset: clientset}
c := cluster{
context: context,
ClusterInfo: cephclient.AdminTestClusterInfo("cluster"),
Spec: &cephv1.ClusterSpec{},
}
context.Executor = &exectest.MockExecutor{
MockExecuteCommandWithOutput: func(command string, args ...string) (string, error) {
logger.Infof("Command: %s %v", command, args)
if args[0] == "osd" {
if args[1] == "dump" {
return fmt.Sprintf(
`{ "full_ratio": %.2f,
"backfillfull_ratio": %.2f,
"nearfull_ratio": %.2f}`, actualFullRatio, actualBackfillFullRatio, actualNearFullRatio), nil
}
if args[1] == "set-full-ratio" {
assert.Equal(t, fmt.Sprintf("%.2f", *c.Spec.Storage.FullRatio), args[2])
setFullRatio = true
return "", nil
}
if args[1] == "set-nearfull-ratio" {
assert.Equal(t, fmt.Sprintf("%.2f", *c.Spec.Storage.NearFullRatio), args[2])
setNearFullRatio = true
return "", nil
}
if args[1] == "set-backfillfull-ratio" {
assert.Equal(t, fmt.Sprintf("%.2f", *c.Spec.Storage.BackfillFullRatio), args[2])
setBackfillFullRatio = true
return "", nil
}
}
return "", errors.New("mock error to simulate failure of mon store config")
},
}
t.Run("no settings", func(t *testing.T) {
err := c.configureStorageSettings()
assert.NoError(t, err)
assert.False(t, setFullRatio)
assert.False(t, setNearFullRatio)
assert.False(t, setBackfillFullRatio)
})

val91 := 0.91
val90 := 0.90
val85 := 0.85
val80 := 0.80

t.Run("all settings applied", func(t *testing.T) {
c.Spec.Storage.FullRatio = &val90
c.Spec.Storage.NearFullRatio = &val80
c.Spec.Storage.BackfillFullRatio = &val85
err := c.configureStorageSettings()
assert.NoError(t, err)
assert.True(t, setFullRatio)
assert.True(t, setNearFullRatio)
assert.True(t, setBackfillFullRatio)
})

t.Run("no settings changed", func(t *testing.T) {
setFullRatio = false
setBackfillFullRatio = false
setNearFullRatio = false
c.Spec.Storage.FullRatio = &actualFullRatio
c.Spec.Storage.NearFullRatio = &actualNearFullRatio
c.Spec.Storage.BackfillFullRatio = &actualBackfillFullRatio
err := c.configureStorageSettings()
assert.NoError(t, err)
assert.False(t, setFullRatio)
assert.False(t, setNearFullRatio)
assert.False(t, setBackfillFullRatio)
})

t.Run("one setting applied", func(t *testing.T) {
setFullRatio = false
setBackfillFullRatio = false
setNearFullRatio = false
c.Spec.Storage.FullRatio = &val91
c.Spec.Storage.NearFullRatio = nil
c.Spec.Storage.BackfillFullRatio = nil
err := c.configureStorageSettings()
assert.NoError(t, err)
assert.True(t, setFullRatio)
assert.False(t, setNearFullRatio)
assert.False(t, setBackfillFullRatio)
})
}
8 changes: 8 additions & 0 deletions tests/framework/installer/ceph_manifests.go
Original file line number Diff line number Diff line change
Expand Up @@ -238,6 +238,14 @@ spec:
config:
databaseSizeMB: "1024"
`
// Append the storage settings if it's not an upgrade from 1.13 where the settings do not exist
if m.settings.RookVersion != Version1_13 {
clusterSpec += `
fullRatio: 0.96
backfillFullRatio: 0.91
nearFullRatio: 0.88
`
}
}

if m.settings.ConnectionsEncrypted {
Expand Down

0 comments on commit bcb7530

Please sign in to comment.