Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 2315624: Fix mds liveness probe and mon failover #746

Merged
merged 2 commits into from
Oct 8, 2024

Conversation

parth-gr
Copy link
Member

@parth-gr parth-gr commented Oct 8, 2024

  1. When the MDS liveness probe times out, it should not fail the probe. If
    the cluster has a network partition or other issue that causes the Ceph
    mon cluster to become unstable, ceph ... commands can hang and cause
    a timeout. In this case, the MDS should not be restarted so as to not
    cause cascading cluster disruption.

  2. If the mon failover is in progress, ensure the removal
    of an extra mon deployment is skipped since that code
    path only has one mon in the list for the mon that was
    just newly started. The extra mon was erroneously removing
    a random mon in that case, followed immediately by the mon
    failover completing and removing the expected failed mon,
    and potentially causing mon quroum loss.

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
  • Reviewed the developer guide on Submitting a Pull Request
  • Pending release notes updated with breaking and/or notable changes for the next minor release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.

BlaineEXE and others added 2 commits October 8, 2024 10:52
When the MDS liveness probe times out, it should not fail the probe. If
the cluster has a network partition or other issue that causes the Ceph
mon cluster to become unstable, `ceph ...` commands can hang and cause
a timeout. In this case, the MDS should not be restarted so as to not
cause cascading cluster disruption.

Signed-off-by: Blaine Gardner <[email protected]>
(cherry picked from commit ad1bae9)
If the mon failover is in progress, ensure the removal
of an extra mon deployment is skipped since that code
path only has one mon in the list for the mon that was
just newly started. The extra mon was erroneously removing
a random mon in that case, followed immediately by the mon
failover completing and removing the expected failed mon,
and potentially causing mon quroum loss.

Signed-off-by: Travis Nielsen <[email protected]>
(cherry picked from commit e2cadab)
@openshift-ci openshift-ci bot added bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Oct 8, 2024
Copy link

openshift-ci bot commented Oct 8, 2024

@parth-gr: This pull request references Bugzilla bug 2315624, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (ODF 4.17.0) matches configured target release for branch (ODF 4.17.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @nehaberry

In response to this:

Bug 2315624: Fix mds liveness probe and mon failover

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link

openshift-ci bot commented Oct 8, 2024

@openshift-ci[bot]: GitHub didn't allow me to request PR reviews from the following users: nehaberry.

Note that only red-hat-storage members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

@parth-gr: This pull request references Bugzilla bug 2315624, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (ODF 4.17.0) matches configured target release for branch (ODF 4.17.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @nehaberry

In response to this:

Bug 2315624: Fix mds liveness probe and mon failover

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@agarwal-mudit agarwal-mudit added approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. labels Oct 8, 2024
Copy link

openshift-ci bot commented Oct 8, 2024

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by: parth-gr, sp98

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sp98 sp98 merged commit a948c8d into red-hat-storage:release-4.17 Oct 8, 2024
50 of 51 checks passed
Copy link

openshift-ci bot commented Oct 8, 2024

@parth-gr: All pull requests linked via external trackers have merged:

Bugzilla bug 2315624 has been moved to the MODIFIED state.

In response to this:

Bug 2315624: Fix mds liveness probe and mon failover

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants