You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, _on_start is a method called when the charm starts the very first time or whenever we have reboot of the underlying machine.
I will focus on the first time start only. The method itself grew to have a lot of logic and dependencies in it that stretchs far beyond the actual start routine. The current _on_start will only really finish when we have TLS available and configured + peer cluster manager details. These two information have their own routines and could leave somewhere else.
Below, you will see my review for the method code + comments for each section right underneath those sections.
def _on_start(self, event: StartEvent): # noqa C901 ## <<<<<<< this MUST go :(
"""Triggered when on start. Set the right node role."""
def cleanup():
if self.peers_data.get(Scope.APP, "security_index_initialised"):
# in the case where it was on WaitingToStart status, event got deferred
# and the service started in between, put status back to active
self.status.clear(WaitingToStart)
self.status.clear(PClusterNoDataNode)
So, the first part has a check if we have set sercurity_index_initialised and executes two status clean-ups. This logic can be moved down to when we set this parameter in the peer databag.
The next part:
# cleanup bootstrap conf in the node if existing
if self.peers_data.get(Scope.UNIT, "bootstrap_contributor"):
self._cleanup_bootstrap_conf_if_applies()
Although the bootstrap_contributor is part of the initialization logic, hence makes sense to be operated by the start, I'd recommend to have a separated event dedicated to do this clean-up of the config. Reason being that we can then have a simpler event handler that either cleans up the configs and restarts or defers itself.
if self.opensearch.is_node_up():
cleanup()
return
This should not exist on the start, as we can only start the service post- TLS setup. The is_node_up(), logic should be entirely removed and we do the (1) status cleanup at peer events; and (2) the bootstrap config clean up should be a separated event, that only handles that specific task.
elif (
self.peers_data.get(Scope.UNIT, "started")
and "cluster_manager" in self.opensearch.roles
and not self.opensearch.is_service_started()
):
# Routine executed only once at node reboot
...
There is no defer here and it is only called if we had a reboot.
# apply the directives computed and emitted by the peer cluster manager
if not self._apply_peer_cm_directives_and_check_if_can_start():
event.defer()
return
This logic is only executed once we have the deployment-description available. Hence, we can only execute it post peer-cluster or peer events.
if not self.is_admin_user_configured() or not self.tls.is_fully_configured():
if not self.model.get_relation("certificates"):
status = BlockedStatus(TLSRelationMissing)
else:
status = MaintenanceStatus(
TLSNotFullyConfigured
if self.is_admin_user_configured()
else AdminUserNotConfigured
)
self.status.set(status)
event.defer()
return
I think we should keep this logic, but instead, simply check: (1) not self.is_admin_user_configured()? then we set its status; and (2) if we pass check 1., then check not self.tls.is_fully_configured() and set the appropriate status.
These clean-ups should be executed by their own routines (i.e. whoever sets the admin user, TLS and checks for TLS relation respectively). The _on_start only runs a if/elif/else check and sets the most pressing status at that moment.
# Since system users are initialized, we should take them to local internal_users.yml
# Leader should be done already
if not self.unit.is_leader():
self._purge_users()
for user in OpenSearchSystemUsers:
self._put_or_update_internal_user_unit(user)
# configure clients auth
self.opensearch_config.set_client_auth()
This should be executed with the routines responsible for the user handling, not here.
deployment_desc = self.opensearch_peer_cm.deployment_desc()
# only start the main orchestrator if a data node is available
# this allows for "cluster-manager-only" nodes in large deployments
# workflow documentation:
# no "data" role in deployment desc -> start gets deferred
# when "data" node joins -> start cluster-manager via _on_peer_cluster_relation_changed
# cluster-manager notifies "data" node via refresh of peer cluster relation data
# "data" node starts and initializes security index
if (
deployment_desc.typ == DeploymentType.MAIN_ORCHESTRATOR
and not deployment_desc.start == StartMode.WITH_GENERATED_ROLES
and "data" not in deployment_desc.config.roles
):
self.status.set(BlockedStatus(PClusterNoDataNode))
event.defer()
return
# request the start of OpenSearch
self.status.set(WaitingStatus(RequestUnitServiceOps.format("start")))
# if this is the first data node to join, start without getting the lock
ignore_lock = (
"data" in deployment_desc.config.roles
and self.unit.is_leader()
and deployment_desc.typ == DeploymentType.OTHER
and not self.peers_data.get(Scope.APP, "security_index_initialised", False)
)
self._start_opensearch_event.emit(ignore_lock=ignore_lock)
So the entire logic above is set to decide if we will do a restart with our without lock. We should move this logic to the points where we call the restart itself; and remove this restart call as it only makes sense if we have the TLS already available.
The text was updated successfully, but these errors were encountered:
Currently,
_on_start
is a method called when the charm starts the very first time or whenever we have reboot of the underlying machine.I will focus on the first time start only. The method itself grew to have a lot of logic and dependencies in it that stretchs far beyond the actual start routine. The current
_on_start
will only really finish when we have TLS available and configured + peer cluster manager details. These two information have their own routines and could leave somewhere else.Below, you will see my review for the method code + comments for each section right underneath those sections.
So, the first part has a check if we have set
sercurity_index_initialised
and executes two status clean-ups. This logic can be moved down to when we set this parameter in the peer databag.The next part:
Although the
bootstrap_contributor
is part of the initialization logic, hence makes sense to be operated by thestart
, I'd recommend to have a separated event dedicated to do this clean-up of the config. Reason being that we can then have a simpler event handler that either cleans up the configs and restarts or defers itself.This should not exist on the start, as we can only start the service post- TLS setup. The
is_node_up()
, logic should be entirely removed and we do the (1) status cleanup at peer events; and (2) the bootstrap config clean up should be a separated event, that only handles that specific task.There is no defer here and it is only called if we had a reboot.
This logic is only executed once we have the
deployment-description
available. Hence, we can only execute it post peer-cluster or peer events.I think we should keep this logic, but instead, simply check: (1)
not self.is_admin_user_configured()
? then we set its status; and (2) if we pass check 1., then checknot self.tls.is_fully_configured()
and set the appropriate status.These clean-ups should be executed by their own routines (i.e. whoever sets the admin user, TLS and checks for TLS relation respectively). The
_on_start
only runs a if/elif/else check and sets the most pressing status at that moment.This should be executed with the routines responsible for the user handling, not here.
So the entire logic above is set to decide if we will do a restart with our without lock. We should move this logic to the points where we call the restart itself; and remove this restart call as it only makes sense if we have the TLS already available.
The text was updated successfully, but these errors were encountered: