Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom taints and toleration node operation #9920

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

vkathole
Copy link
Contributor

@vkathole vkathole commented Jun 7, 2024

No description provided.

@vkathole vkathole requested review from a team as code owners June 7, 2024 10:27
@pull-request-size pull-request-size bot added the size/L PR that changes 100-499 lines label Jun 7, 2024
@vkathole vkathole added team/e2e E2E team related issues/PRs and removed size/L PR that changes 100-499 lines labels Jun 7, 2024
Copy link

openshift-ci bot commented Jun 7, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: vkathole

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

toleration_key (str): The toleration key to check
"""

sub_list = ocp.get_all_resource_names_of_a_kind(kind=constants.SUBSCRIPTION)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are getting all the resource names of a particular kind and getting its obj instantiating ocp class. Instead, you can directly get all the subscription obj list as below ,
sub_obj_list = ocp.OCP(namespace=config.ENV_DATA["cluster_namespace"], kind=constants.SUBSCRIPTION).get()

ocs_ci/ocs/resources/pod.py Outdated Show resolved Hide resolved
ocs_ci/ocs/resources/pod.py Outdated Show resolved Hide resolved
ocs_ci/ocs/resources/pod.py Outdated Show resolved Hide resolved
"""

logger.info("Taint all nodes with custom taint")
ocs_nodes = get_worker_nodes()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this fun returns all the worker nodes, should we taint only on odf pod running nodes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use get_ocs_nodes() which will return the ocs nodes alone

@pull-request-size pull-request-size bot added the size/M PR that changes 30-99 lines label Sep 19, 2024
Signed-off-by: vkathole <[email protected]>
Signed-off-by: vkathole <[email protected]>
Copy link

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation

Cluster Name:
Cluster Configuration:
PR Test Suite: tier4b
PR Test Path: tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py
Additional Test Params:
OCP VERSION: 4.17
OCS VERSION: 4.17
tested against branch: master

Job UNSTABLE (some or all tests failed).

Signed-off-by: vkathole <[email protected]>
@pull-request-size pull-request-size bot added size/L PR that changes 100-499 lines and removed size/M PR that changes 30-99 lines labels Sep 26, 2024
Copy link

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation on existing cluster

Cluster Name: vkathole-t26
Cluster Configuration:
PR Test Suite: tier4b
PR Test Path: tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py
Additional Test Params:
OCP VERSION: 4.17
OCS VERSION: 4.17
tested against branch: master

Job UNSTABLE (some or all tests failed).

Signed-off-by: vkathole <[email protected]>
Copy link

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation

Cluster Name:
Cluster Configuration:
PR Test Suite: tier4b
PR Test Path: tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py
Additional Test Params:
OCP VERSION: 4.17
OCS VERSION: 4.17
tested against branch: master

Job UNSTABLE (some or all tests failed).

Signed-off-by: vkathole <[email protected]>
Copy link

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation

Cluster Name:
Cluster Configuration:
PR Test Suite: tier4b
PR Test Path: tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py
Additional Test Params:
OCP VERSION: 4.17
OCS VERSION: 4.17
tested against branch: master

Job UNSTABLE (some or all tests failed).

Copy link

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation on existing cluster

Cluster Name: vkathole-o1
Cluster Configuration:
PR Test Suite: tier4b
PR Test Path: tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py
Additional Test Params:
OCP VERSION: 4.17
OCS VERSION: 4.17
tested against branch: master

Job UNSTABLE (some or all tests failed).

Signed-off-by: vkathole <[email protected]>
Copy link

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation on existing cluster

Cluster Name: vkathole-o1
Cluster Configuration:
PR Test Suite: tier4b
PR Test Path: tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py
Additional Test Params:
OCP VERSION: 4.17
OCS VERSION: 4.17
tested against branch: master

Job UNSTABLE (some or all tests failed).

Signed-off-by: vkathole <[email protected]>
Copy link

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation

Cluster Name:
Cluster Configuration:
PR Test Suite: tier4b
PR Test Path: tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py
Additional Test Params:
OCP VERSION: 4.17
OCS VERSION: 4.17
tested against branch: master

Job UNSTABLE (some or all tests failed).

@@ -255,3 +299,145 @@ def test_non_ocs_taint_and_tolerations(self):
resource_count=count * replica_count,
), "New OSDs failed to reach running state"
check_ceph_health_after_add_capacity(ceph_rebalance_timeout=2500)

# Reboot one of the nodes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are performing 4 admin operations in the same test. should we consider moving the newly added operations to a separate test function ? The repetitive code for setting the non ocs taints can be moved to a common function and called in the tests.

Please take input from Brown squad as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Comment on lines +305 to +309
if is_vsphere_ipi_cluster():
nodes.restart_nodes(nodes=node, wait=False)
node_names = [n.name for n in node]
wait_for_nodes_status(node_names, constants.STATUS_READY, timeout=420)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the vSPhere node related API would be same for both IPI and UPI. If that is the case, you really don't need to handle vpshere ipi case here

else:
nodes.restart_nodes_by_stop_and_start(nodes=node)

# Wait some time after rebooting master
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are rebooting worker, please correct it in the comment statment

@@ -255,3 +299,145 @@ def test_non_ocs_taint_and_tolerations(self):
resource_count=count * replica_count,
), "New OSDs failed to reach running state"
check_ceph_health_after_add_capacity(ceph_rebalance_timeout=2500)

# Reboot one of the nodes
node = get_nodes("worker", num_of_nodes=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to get the ocs_nodes. There are chances that a non ocs node can get picked. Also, I would suggest selecting randomly instead of selecting the same node everytime

Comment on lines +314 to +315
logger.info(f"Waiting {waiting_time} seconds.")
time.sleep(waiting_time)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are waiting for the cluster connectivity below, do we still need to sleep?

"""

logger.info("Taint all nodes with custom taint")
ocs_nodes = get_worker_nodes()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use get_ocs_nodes() which will return the ocs nodes alone

'[{"effect": "NoSchedule", "key": "xyz", "operator": "Equal", '
'"value": "true"}]}}}'
)
# Select one subscription other than odf subscription
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a log info message on the step that the test is doing here

for sub in sub_list:
if sub != constants.ODF_SUBSCRIPTION:
selected_sub = sub
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a message logging the selected subscription other than odf

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what subscription can we expect here? other than odf

check_toleration_on_pods(toleration_key="xyz")
raise AssertionError("Toleration was found, but it should not exist.")
except TolerationNotFoundException:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can add a log message here replacing pass and in line 423

params = '[{"op": "remove", "path": "/spec/placement"},]'
storagecluster_obj.patch(params=params, format_type="json")

logger.info("Remove tolerations to the subscription")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.info("Remove tolerations to the subscription")
logger.info("Remove tolerations from the subscriptions")

time.sleep(180)
assert wait_for_pods_to_be_running(
timeout=900, sleep=15
), "some of the pods didn't came up running"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
), "some of the pods didn't came up running"
), "Few pods failed to reach the desired running state"

if "config" in subscription_data.get("spec", {}):
params = '[{"op": "remove", "path": "/spec/config"}]'
sub_obj.patch(resource_name=sub, params=params, format_type="json")
time.sleep(180)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we not supposed to remove the tolerations from the rook-ceph operator configmap and ocsinitializations.ocs.openshift.io too ??

@@ -255,3 +299,145 @@ def test_non_ocs_taint_and_tolerations(self):
resource_count=count * replica_count,
), "New OSDs failed to reach running state"
check_ceph_health_after_add_capacity(ceph_rebalance_timeout=2500)

# Reboot one of the nodes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

def test_negative_custom_taint(self, nodes):
"""
Test runs the following steps
1. Taint odf nodes with non-ocs taint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. Taint odf nodes with non-ocs taint
1. Taint odf worker nodes with non-ocs taint


assert not wait_for_pods_to_be_running(
timeout=120, sleep=15
), "Pods are running when they should not be."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we expecting all pods to go in a bad state ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see we apply tolerations on storagecluster and subscription other than ODF, are we sure all pods will not be running if the toleration is just not applied properly on sub when we are setting it properly on storagecluster ? Please check the scenario again. if we are setting the toleration properly on storagecluster few pods should be up and running.

), "Pods are running when they should not be."

logger.info(
"Check custom toleration on all newly created pods under openshift-storage"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Check custom toleration on all newly created pods under openshift-storage"
"Validate custom toleration not found on all newly created pods in openshift-storage"

@Shrivaibavi Shrivaibavi requested review from a team and removed request for a team October 10, 2024 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/L PR that changes 100-499 lines team/e2e E2E team related issues/PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants