Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ironic rebalance (victoria) #6

Open
wants to merge 63 commits into
base: stackhpc/victora
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
b87ced4
libvirt: make mdev types name attribute be optional
sbauza Sep 23, 2020
cbf44b7
Add functional regression test for bug 1853009
markgoddard Nov 19, 2019
730071f
Clear rebalanced compute nodes from resource tracker
stephenfin Apr 28, 2021
4f1e4b3
Invalidate provider tree when compute node disappears
markgoddard Nov 19, 2019
054b08e
Prevent deletion of a compute node belonging to another host
markgoddard Nov 18, 2019
03fe127
Fix inactive session error in compute node creation
markgoddard Nov 20, 2019
0354d4d
Reproduce bug 1897528
Oct 8, 2020
90ffc55
Ignore PCI devices with 32bit domain
Oct 8, 2020
6b70350
Reject open redirection in the console proxy
melwitt May 13, 2021
0405709
rbd: Get rbd_utils unit tests running again
melwitt May 10, 2021
2af08fb
zuul: Replace grenade and nova-grenade-multinode with grenade-multinode
lyarwood Mar 5, 2021
f20346b
Honor [neutron]http_retries in the manual client
melwitt May 28, 2021
e3085fa
Initialize global data separately and run_once in WSGI app init
4383 Jun 4, 2020
6ede6df
Error anti-affinity violation on migrations
rodrigogansobarbieri Mar 31, 2021
e553082
Merge "Reproduce bug 1897528" into stable/victoria
Jun 17, 2021
78a63c2
Merge "Ignore PCI devices with 32bit domain" into stable/victoria
Jun 17, 2021
6305ae4
Allow X-OpenStack-Nova-API-Version header in CORS
mnaser Jun 14, 2021
b7677ae
Move 'check-cherry-picks' test to gate, n-v check
stephenfin Jun 16, 2021
f7d84db
[neutron] Get only ID and name of the SGs from Neutron
slawqo Mar 26, 2021
4bb2338
Merge "rbd: Get rbd_utils unit tests running again" into stable/victoria
Jun 25, 2021
66fb0ec
Merge "Initialize global data separately and run_once in WSGI app ini…
Jun 25, 2021
bec6dd4
Stop leaking ceph df cmd in RBD utils
tobias-urdin May 3, 2021
cf6288b
Merge "Honor [neutron]http_retries in the manual client" into stable/…
Jul 8, 2021
4135970
Merge "Stop leaking ceph df cmd in RBD utils" into stable/victoria
Jul 9, 2021
b2fd01f
Merge "Reject open redirection in the console proxy" into stable/vict…
Jul 26, 2021
9efdd0b
Reproducer unit test for bug 1860312
notartom Jul 28, 2021
e238cc9
Allow deletion of compute service with no compute nodes
notartom Jul 19, 2021
94e265f
Reduce mocking in test_reject_open_redirect for compat
melwitt Jul 31, 2021
4ce01d6
Merge "Move 'check-cherry-picks' test to gate, n-v check" into stable…
Aug 3, 2021
7dbceec
Fix request path to query a resource provider by uuid
kajinamit Jul 15, 2021
1eceeeb
Avoid modifying the Mock class in test
Aug 23, 2021
aaa5624
Fix 1vcpu error with multiqueue and vif_type=tap
rodrigogansobarbieri Aug 11, 2021
14277ac
Merge "Fix 1vcpu error with multiqueue and vif_type=tap" into stable/…
Aug 31, 2021
66ccea7
Merge "Reproducer unit test for bug 1860312" into stable/victoria
Sep 1, 2021
85c3c22
Merge "Reduce mocking in test_reject_open_redirect for compat" into s…
Sep 1, 2021
f06c1aa
Merge "Fix request path to query a resource provider by uuid" into st…
Sep 3, 2021
816c3e9
Merge "Allow X-OpenStack-Nova-API-Version header in CORS" into stable…
Sep 4, 2021
e9b6077
Merge "Allow deletion of compute service with no compute nodes" into …
Sep 5, 2021
9588cdb
address open redirect with 3 forward slashes
SeanMooney Aug 23, 2021
7f00f7b
Merge "address open redirect with 3 forward slashes" into stable/vict…
Sep 16, 2021
0b1fa9b
Reproduce bug 1944759
Sep 23, 2021
34e0c02
Store old_flavor already on source host during resize
Sep 24, 2021
c531fdc
Add a WA flag waiting for vif-plugged event during reboot
Oct 11, 2021
28d0059
Ensure MAC addresses characters are in the same case
Sep 30, 2021
e549fec
Reproduce bug 1953359
Dec 6, 2021
8d44874
Extend the reproducer for 1953359 and 1952915
Dec 6, 2021
d54bd31
[rt] Apply migration context for incoming migrations
Dec 6, 2021
920b3b3
Merge "Reproduce bug 1953359" into stable/victoria
Feb 11, 2022
69fafb9
Merge "Extend the reproducer for 1953359 and 1952915" into stable/vic…
Feb 11, 2022
9b3d69c
Merge "Add a WA flag waiting for vif-plugged event during reboot" int…
Feb 11, 2022
b5b57c4
Merge "[rt] Apply migration context for incoming migrations" into sta…
Mar 2, 2022
8ff36f1
[stable-only] Drop lower-constraints job
Apr 14, 2022
bc92f05
[CI] Install dependencies for docs target
Apr 28, 2022
d218250
Define new functional test tox env for placement gate to run
gmannos Oct 12, 2021
bc5fc2b
[ironic] Minimize window for a resource provider to be lost
juliakreger Jul 2, 2021
8097c2b
Merge "libvirt: make mdev types name attribute be optional" into stab…
Jul 18, 2022
4b9eba6
Merge "[ironic] Minimize window for a resource provider to be lost" i…
Aug 17, 2022
35fb52f
Ignore plug_vifs on the ironic driver
juliakreger Oct 8, 2021
b6c8773
add regression test case for bug 1978983
Jul 6, 2022
3224ceb
For evacuation, ignore if task_state is not None
Jul 6, 2022
31fd81c
Merge remote-tracking branch 'origin/stable/victoria' into stackhpc/v…
jovial Oct 28, 2022
c0ade88
Merge remote-tracking branch 'origin/stable/victoria' into stackhpc/v…
jovial Oct 28, 2022
b014d67
libvirt: Add a workaround to skip compareCPU() on destination
kashyapc Jan 28, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 25 additions & 10 deletions nova/compute/manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -9898,6 +9898,8 @@ def update_available_resource(self, context, startup=False):
use_slave=True,
startup=startup)

self.rt.clean_compute_node_cache(compute_nodes_in_db)

# Delete orphan compute node not reported by driver but still in db
for cn in compute_nodes_in_db:
if cn.hypervisor_hostname not in nodenames:
Expand All @@ -9906,17 +9908,30 @@ def update_available_resource(self, context, startup=False):
"nodes are %(nodes)s",
{'id': cn.id, 'hh': cn.hypervisor_hostname,
'nodes': nodenames})
cn.destroy()
self.rt.remove_node(cn.hypervisor_hostname)
# Delete the corresponding resource provider in placement,
# along with any associated allocations.
try:
self.reportclient.delete_resource_provider(context, cn,
cascade=True)
except keystone_exception.ClientException as e:
LOG.error(
"Failed to delete compute node resource provider "
"for compute node %s: %s", cn.uuid, six.text_type(e))
cn.destroy()
except exception.ComputeHostNotFound:
# NOTE(mgoddard): it's possible that another compute
# service took ownership of this compute node since we
# queried it due to a rebalance, and this will cause the
# deletion to fail. Ignore the error in that case.
LOG.info("Ignoring failure to delete orphan compute node "
"%(id)s on hypervisor host %(hh)s due to "
"possible node rebalance",
{'id': cn.id, 'hh': cn.hypervisor_hostname})
self.rt.remove_node(cn.hypervisor_hostname)
self.reportclient.invalidate_resource_provider(cn.uuid)
else:
self.rt.remove_node(cn.hypervisor_hostname)
# Delete the corresponding resource provider in placement,
# along with any associated allocations.
try:
self.reportclient.delete_resource_provider(
context, cn, cascade=True)
except keystone_exception.ClientException as e:
LOG.error(
"Failed to delete compute node resource provider "
"for compute node %s: %s", cn.uuid, str(e))

for nodename in nodenames:
self._update_available_resource_for_node(context, nodename,
Expand Down
18 changes: 18 additions & 0 deletions nova/compute/resource_tracker.py
Original file line number Diff line number Diff line change
Expand Up @@ -1988,3 +1988,21 @@ def finish_evacuation(self, instance, node, migration):
if migration:
migration.status = 'done'
migration.save()

@utils.synchronized(COMPUTE_RESOURCE_SEMAPHORE, fair=True)
def clean_compute_node_cache(self, compute_nodes_in_db):
"""Clean the compute node cache of any nodes that no longer exist.

:param compute_nodes_in_db: list of ComputeNode objects from the DB.
"""
compute_nodes_in_db_nodenames = {cn.hypervisor_hostname
for cn in compute_nodes_in_db}
stale_cns = set(self.compute_nodes) - compute_nodes_in_db_nodenames

for stale_cn in stale_cns:
# NOTE(mgoddard): we have found a node in the cache that has no
# compute node in the DB. This could be due to a node rebalance
# where another compute service took ownership of the node. Clean
# up the cache.
self.remove_node(stale_cn)
self.reportclient.invalidate_resource_provider(stale_cn)
5 changes: 3 additions & 2 deletions nova/db/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -344,15 +344,16 @@ def compute_node_update(context, compute_id, values):
return IMPL.compute_node_update(context, compute_id, values)


def compute_node_delete(context, compute_id):
def compute_node_delete(context, compute_id, compute_host):
"""Delete a compute node from the database.

:param context: The security context
:param compute_id: ID of the compute node
:param compute_host: Hostname of the compute service

Raises ComputeHostNotFound if compute node with the given ID doesn't exist.
"""
return IMPL.compute_node_delete(context, compute_id)
return IMPL.compute_node_delete(context, compute_id, compute_host)


def compute_node_statistics(context):
Expand Down
21 changes: 17 additions & 4 deletions nova/db/sqlalchemy/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -690,16 +690,29 @@ def compute_node_search_by_hypervisor(context, hypervisor_match):


@pick_context_manager_writer
def compute_node_create(context, values):
def _compute_node_create(context, values):
"""Creates a new ComputeNode and populates the capacity fields
with the most recent data.
"""
convert_objects_related_datetimes(values)

compute_node_ref = models.ComputeNode()
compute_node_ref.update(values)
compute_node_ref.save(context.session)
return compute_node_ref


# NOTE(mgoddard): We avoid decorating this with @pick_context_manager_writer,
# so that we get a separate transaction in the exception handler. This avoids
# an error message about inactive DB sessions during a transaction rollback.
# See https://bugs.launchpad.net/nova/+bug/1853159.
def compute_node_create(context, values):
"""Creates a new ComputeNode and populates the capacity fields
with the most recent data. Will restore a soft deleted compute node if a
UUID has been explicitly requested.
"""
try:
compute_node_ref.save(context.session)
compute_node_ref = _compute_node_create(context, values)
except db_exc.DBDuplicateEntry:
with excutils.save_and_reraise_exception(logger=LOG) as err_ctx:
# Check to see if we have a (soft) deleted ComputeNode with the
Expand Down Expand Up @@ -763,10 +776,10 @@ def compute_node_update(context, compute_id, values):


@pick_context_manager_writer
def compute_node_delete(context, compute_id):
def compute_node_delete(context, compute_id, compute_host):
"""Delete a ComputeNode record."""
result = model_query(context, models.ComputeNode).\
filter_by(id=compute_id).\
filter_by(id=compute_id, host=compute_host).\
soft_delete(synchronize_session=False)

if not result:
Expand Down
2 changes: 1 addition & 1 deletion nova/objects/compute_node.py
Original file line number Diff line number Diff line change
Expand Up @@ -360,7 +360,7 @@ def save(self, prune_stats=False):

@base.remotable
def destroy(self):
db.compute_node_delete(self._context, self.id)
db.compute_node_delete(self._context, self.id, self.host)

def update_from_virt_driver(self, resources):
# NOTE(pmurray): the virt driver provides a dict of values that
Expand Down
17 changes: 12 additions & 5 deletions nova/scheduler/client/report.py
Original file line number Diff line number Diff line change
Expand Up @@ -678,11 +678,7 @@ def _delete_provider(self, rp_uuid, global_request_id=None):
if resp:
LOG.info("Deleted resource provider %s", rp_uuid)
# clean the caches
try:
self._provider_tree.remove(rp_uuid)
except ValueError:
pass
self._association_refresh_time.pop(rp_uuid, None)
self.invalidate_resource_provider(rp_uuid)
return

msg = ("[%(placement_req_id)s] Failed to delete resource provider "
Expand Down Expand Up @@ -2181,6 +2177,17 @@ def delete_resource_provider(self, context, compute_node, cascade=False):
# left a no-op for backward compatibility.
pass

def invalidate_resource_provider(self, name_or_uuid):
"""Invalidate the cache for a resource provider.

:param name_or_uuid: Name or UUID of the resource provider to look up.
"""
try:
self._provider_tree.remove(name_or_uuid)
except ValueError:
pass
self._association_refresh_time.pop(name_or_uuid, None)

def get_provider_by_name(self, context, name):
"""Queries the placement API for resource provider information matching
a supplied name.
Expand Down
171 changes: 171 additions & 0 deletions nova/tests/functional/regressions/test_bug_1853009.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
# Licensed under the Apache License, Version 2.0 (the "License"); you may
# not use this file except in compliance with the License. You may obtain
# a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
# License for the specific language governing permissions and limitations
# under the License.

import mock

from nova import context
from nova import objects
from nova.tests.functional import integrated_helpers


class NodeRebalanceDeletedComputeNodeRaceTestCase(
integrated_helpers.ProviderUsageBaseTestCase,
):
"""Regression test for bug 1853009 observed in Rocky & later.

When an ironic node re-balances from one host to another, there can be a
race where the old host deletes the orphan compute node after the new host
has taken ownership of it which results in the new host failing to create
the compute node and resource provider because the ResourceTracker does not
detect a change.
"""
# Make sure we're using the fake driver that has predictable uuids
# for each node.
compute_driver = 'fake.PredictableNodeUUIDDriver'

def _assert_hypervisor_api(self, nodename, expected_host):
# We should have one compute node shown by the API.
hypervisors = self.api.api_get(
'/os-hypervisors/detail'
).body['hypervisors']
self.assertEqual(1, len(hypervisors), hypervisors)
hypervisor = hypervisors[0]
self.assertEqual(nodename, hypervisor['hypervisor_hostname'])
self.assertEqual(expected_host, hypervisor['service']['host'])

def _start_compute(self, host):
host = self.start_service('compute', host)
# Ironic compute driver has rebalances_nodes = True.
host.manager.driver.rebalances_nodes = True
return host

def setUp(self):
super(NodeRebalanceDeletedComputeNodeRaceTestCase, self).setUp()

self.nodename = 'fake-node'
self.ctxt = context.get_admin_context()

def test_node_rebalance_deleted_compute_node_race(self):
# Simulate a service running and then stopping. host_a runs, creates
# fake-node, then is stopped. The fake-node compute node is destroyed.
# This leaves a soft-deleted node in the DB.
host_a = self._start_compute('host_a')
host_a.manager.driver._set_nodes([self.nodename])
host_a.manager.update_available_resource(self.ctxt)
host_a.stop()
cn = objects.ComputeNode.get_by_host_and_nodename(
self.ctxt, 'host_a', self.nodename,
)
cn.destroy()

self.assertEqual(0, len(objects.ComputeNodeList.get_all(self.ctxt)))

# Now we create a new compute service to manage our node.
host_b = self._start_compute('host_b')
host_b.manager.driver._set_nodes([self.nodename])

# When start_service runs, it will create a host_b ComputeNode. We want
# to delete that and inject our fake node into the driver which will
# be re-balanced to another host later. First assert this actually
# exists.
self._assert_hypervisor_api('host_b', expected_host='host_b')

# Now run the update_available_resource periodic to register fake-node
# and have it managed by host_b. This will also detect the "host_b"
# node as orphaned and delete it along with its resource provider.

# host_b[1]: Finds no compute record in RT. Tries to create one
# (_init_compute_node).
host_b.manager.update_available_resource(self.ctxt)
self._assert_hypervisor_api(self.nodename, expected_host='host_b')
# There should only be one resource provider (fake-node).
original_rps = self._get_all_providers()
self.assertEqual(1, len(original_rps), original_rps)
self.assertEqual(self.nodename, original_rps[0]['name'])

# Simulate a re-balance by restarting host_a and make it manage
# fake-node. At this point both host_b and host_a think they own
# fake-node.
host_a = self._start_compute('host_a')
host_a.manager.driver._set_nodes([self.nodename])

# host_a[1]: Finds no compute record in RT, 'moves' existing node from
# host_b
host_a.manager.update_available_resource(self.ctxt)
# Assert that fake-node was re-balanced from host_b to host_a.
self.assertIn(
'ComputeNode fake-node moving from host_b to host_a',
self.stdlog.logger.output)
self._assert_hypervisor_api(self.nodename, expected_host='host_a')

# host_a[2]: Begins periodic update, queries compute nodes for this
# host, finds the fake-node.
cn = objects.ComputeNode.get_by_host_and_nodename(
self.ctxt, 'host_a', self.nodename,
)

# host_b[2]: Finds no compute record in RT, 'moves' existing node from
# host_a
host_b.manager.update_available_resource(self.ctxt)
# Assert that fake-node was re-balanced from host_a to host_b.
self.assertIn(
'ComputeNode fake-node moving from host_a to host_b',
self.stdlog.logger.output)
self._assert_hypervisor_api(self.nodename, expected_host='host_b')

# Complete rebalance, as host_a realises it does not own fake-node.
host_a.manager.driver._set_nodes([])

# host_a[2]: Deletes orphan compute node.
# Mock out the compute node query to simulate a race condition where
# the list includes an orphan compute node that is newly owned by
# host_b by the time host_a attempts to delete it.
with mock.patch(
'nova.compute.manager.ComputeManager._get_compute_nodes_in_db'
) as mock_get:
mock_get.return_value = [cn]
host_a.manager.update_available_resource(self.ctxt)

# Verify that the node was almost deleted, but was saved by the host
# check
self.assertIn(
'Deleting orphan compute node %s hypervisor host '
'is fake-node, nodes are' % cn.id,
self.stdlog.logger.output)
self.assertIn(
'Ignoring failure to delete orphan compute node %s on '
'hypervisor host fake-node' % cn.id,
self.stdlog.logger.output)
self._assert_hypervisor_api(self.nodename, 'host_b')
rps = self._get_all_providers()
self.assertEqual(1, len(rps), rps)
self.assertEqual(self.nodename, rps[0]['name'])

# Simulate deletion of an orphan by host_a. It shouldn't happen
# anymore, but let's assume it already did.
cn = objects.ComputeNode.get_by_host_and_nodename(
self.ctxt, 'host_b', self.nodename)
cn.destroy()
host_a.manager.rt.remove_node(cn.hypervisor_hostname)
host_a.manager.reportclient.delete_resource_provider(
self.ctxt, cn, cascade=True)

# host_b[3]: Should recreate compute node and resource provider.
host_b.manager.update_available_resource(self.ctxt)

# Verify that the node was recreated.
self._assert_hypervisor_api(self.nodename, 'host_b')

# The resource provider has now been created.
rps = self._get_all_providers()
self.assertEqual(1, len(rps), rps)
self.assertEqual(self.nodename, rps[0]['name'])
4 changes: 3 additions & 1 deletion nova/tests/unit/compute/test_compute.py
Original file line number Diff line number Diff line change
Expand Up @@ -202,8 +202,10 @@ def fake_get_compute_nodes_in_db(self, context, *args, **kwargs):
context, objects.ComputeNode(), cn)
for cn in fake_compute_nodes]

def fake_compute_node_delete(context, compute_node_id):
def fake_compute_node_delete(context, compute_node_id,
compute_node_host):
self.assertEqual(2, compute_node_id)
self.assertEqual('fake_phyp1', compute_node_host)

self.stub_out(
'nova.compute.manager.ComputeManager._get_compute_nodes_in_db',
Expand Down
Loading