-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom taints and toleration node operation #9920
base: master
Are you sure you want to change the base?
Custom taints and toleration node operation #9920
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: vkathole The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py
Outdated
Show resolved
Hide resolved
tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py
Outdated
Show resolved
Hide resolved
tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py
Outdated
Show resolved
Hide resolved
tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py
Outdated
Show resolved
Hide resolved
ocs_ci/ocs/resources/pod.py
Outdated
toleration_key (str): The toleration key to check | ||
""" | ||
|
||
sub_list = ocp.get_all_resource_names_of_a_kind(kind=constants.SUBSCRIPTION) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are getting all the resource names of a particular kind and getting its obj instantiating ocp class. Instead, you can directly get all the subscription obj list as below ,
sub_obj_list = ocp.OCP(namespace=config.ENV_DATA["cluster_namespace"], kind=constants.SUBSCRIPTION).get()
""" | ||
|
||
logger.info("Taint all nodes with custom taint") | ||
ocs_nodes = get_worker_nodes() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this fun returns all the worker nodes, should we taint only on odf pod running nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use get_ocs_nodes() which will return the ocs nodes alone
tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Vishakha Kathole <[email protected]>
273c9cf
to
ee87c8d
Compare
Signed-off-by: vkathole <[email protected]>
Signed-off-by: vkathole <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR validation
Cluster Name:
Cluster Configuration:
PR Test Suite: tier4b
PR Test Path: tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py
Additional Test Params:
OCP VERSION: 4.17
OCS VERSION: 4.17
tested against branch: master
Job UNSTABLE (some or all tests failed).
Signed-off-by: vkathole <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR validation on existing cluster
Cluster Name: vkathole-t26
Cluster Configuration:
PR Test Suite: tier4b
PR Test Path: tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py
Additional Test Params:
OCP VERSION: 4.17
OCS VERSION: 4.17
tested against branch: master
Job UNSTABLE (some or all tests failed).
Signed-off-by: vkathole <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR validation
Cluster Name:
Cluster Configuration:
PR Test Suite: tier4b
PR Test Path: tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py
Additional Test Params:
OCP VERSION: 4.17
OCS VERSION: 4.17
tested against branch: master
Job UNSTABLE (some or all tests failed).
Signed-off-by: vkathole <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR validation
Cluster Name:
Cluster Configuration:
PR Test Suite: tier4b
PR Test Path: tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py
Additional Test Params:
OCP VERSION: 4.17
OCS VERSION: 4.17
tested against branch: master
Job UNSTABLE (some or all tests failed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR validation on existing cluster
Cluster Name: vkathole-o1
Cluster Configuration:
PR Test Suite: tier4b
PR Test Path: tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py
Additional Test Params:
OCP VERSION: 4.17
OCS VERSION: 4.17
tested against branch: master
Job UNSTABLE (some or all tests failed).
Signed-off-by: vkathole <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR validation on existing cluster
Cluster Name: vkathole-o1
Cluster Configuration:
PR Test Suite: tier4b
PR Test Path: tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py
Additional Test Params:
OCP VERSION: 4.17
OCS VERSION: 4.17
tested against branch: master
Job UNSTABLE (some or all tests failed).
Signed-off-by: vkathole <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR validation
Cluster Name:
Cluster Configuration:
PR Test Suite: tier4b
PR Test Path: tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py
Additional Test Params:
OCP VERSION: 4.17
OCS VERSION: 4.17
tested against branch: master
Job UNSTABLE (some or all tests failed).
@@ -255,3 +299,145 @@ def test_non_ocs_taint_and_tolerations(self): | |||
resource_count=count * replica_count, | |||
), "New OSDs failed to reach running state" | |||
check_ceph_health_after_add_capacity(ceph_rebalance_timeout=2500) | |||
|
|||
# Reboot one of the nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are performing 4 admin operations in the same test. should we consider moving the newly added operations to a separate test function ? The repetitive code for setting the non ocs taints can be moved to a common function and called in the tests.
Please take input from Brown squad as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
if is_vsphere_ipi_cluster(): | ||
nodes.restart_nodes(nodes=node, wait=False) | ||
node_names = [n.name for n in node] | ||
wait_for_nodes_status(node_names, constants.STATUS_READY, timeout=420) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, the vSPhere node related API would be same for both IPI and UPI. If that is the case, you really don't need to handle vpshere ipi case here
else: | ||
nodes.restart_nodes_by_stop_and_start(nodes=node) | ||
|
||
# Wait some time after rebooting master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are rebooting worker, please correct it in the comment statment
@@ -255,3 +299,145 @@ def test_non_ocs_taint_and_tolerations(self): | |||
resource_count=count * replica_count, | |||
), "New OSDs failed to reach running state" | |||
check_ceph_health_after_add_capacity(ceph_rebalance_timeout=2500) | |||
|
|||
# Reboot one of the nodes | |||
node = get_nodes("worker", num_of_nodes=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to get the ocs_nodes. There are chances that a non ocs node can get picked. Also, I would suggest selecting randomly instead of selecting the same node everytime
logger.info(f"Waiting {waiting_time} seconds.") | ||
time.sleep(waiting_time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are waiting for the cluster connectivity below, do we still need to sleep?
""" | ||
|
||
logger.info("Taint all nodes with custom taint") | ||
ocs_nodes = get_worker_nodes() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use get_ocs_nodes() which will return the ocs nodes alone
'[{"effect": "NoSchedule", "key": "xyz", "operator": "Equal", ' | ||
'"value": "true"}]}}}' | ||
) | ||
# Select one subscription other than odf subscription |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a log info message on the step that the test is doing here
for sub in sub_list: | ||
if sub != constants.ODF_SUBSCRIPTION: | ||
selected_sub = sub | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a message logging the selected subscription other than odf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what subscription can we expect here? other than odf
check_toleration_on_pods(toleration_key="xyz") | ||
raise AssertionError("Toleration was found, but it should not exist.") | ||
except TolerationNotFoundException: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can add a log message here replacing pass and in line 423
params = '[{"op": "remove", "path": "/spec/placement"},]' | ||
storagecluster_obj.patch(params=params, format_type="json") | ||
|
||
logger.info("Remove tolerations to the subscription") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.info("Remove tolerations to the subscription") | |
logger.info("Remove tolerations from the subscriptions") |
time.sleep(180) | ||
assert wait_for_pods_to_be_running( | ||
timeout=900, sleep=15 | ||
), "some of the pods didn't came up running" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
), "some of the pods didn't came up running" | |
), "Few pods failed to reach the desired running state" |
if "config" in subscription_data.get("spec", {}): | ||
params = '[{"op": "remove", "path": "/spec/config"}]' | ||
sub_obj.patch(resource_name=sub, params=params, format_type="json") | ||
time.sleep(180) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we not supposed to remove the tolerations from the rook-ceph operator configmap and ocsinitializations.ocs.openshift.io too ??
@@ -255,3 +299,145 @@ def test_non_ocs_taint_and_tolerations(self): | |||
resource_count=count * replica_count, | |||
), "New OSDs failed to reach running state" | |||
check_ceph_health_after_add_capacity(ceph_rebalance_timeout=2500) | |||
|
|||
# Reboot one of the nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
def test_negative_custom_taint(self, nodes): | ||
""" | ||
Test runs the following steps | ||
1. Taint odf nodes with non-ocs taint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. Taint odf nodes with non-ocs taint | |
1. Taint odf worker nodes with non-ocs taint |
|
||
assert not wait_for_pods_to_be_running( | ||
timeout=120, sleep=15 | ||
), "Pods are running when they should not be." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we expecting all pods to go in a bad state ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see we apply tolerations on storagecluster and subscription other than ODF, are we sure all pods will not be running if the toleration is just not applied properly on sub when we are setting it properly on storagecluster ? Please check the scenario again. if we are setting the toleration properly on storagecluster few pods should be up and running.
), "Pods are running when they should not be." | ||
|
||
logger.info( | ||
"Check custom toleration on all newly created pods under openshift-storage" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Check custom toleration on all newly created pods under openshift-storage" | |
"Validate custom toleration not found on all newly created pods in openshift-storage" |
No description provided.