Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add troubleshooting of node maintenance mode #619

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

w13915984028
Copy link
Member

Copy link

github-actions bot commented Aug 6, 2024

Name Link
🔨 Latest commit 865f2e6
😎 Deploy Preview https://66cf179e817ba852b932a84d--harvester-preview.netlify.app

Copy link
Contributor

@ejweber ejweber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from the Longhorn side!

docs/host/host.md Outdated Show resolved Hide resolved
docs/troubleshooting/host.md Outdated Show resolved Hide resolved
Copy link
Contributor

@jillian-maroket jillian-maroket left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial review done

docs/troubleshooting/host.md Outdated Show resolved Hide resolved
docs/troubleshooting/host.md Outdated Show resolved Hide resolved
docs/troubleshooting/host.md Outdated Show resolved Hide resolved
docs/troubleshooting/host.md Outdated Show resolved Hide resolved
docs/troubleshooting/host.md Outdated Show resolved Hide resolved
docs/troubleshooting/host.md Outdated Show resolved Hide resolved
docs/troubleshooting/host.md Outdated Show resolved Hide resolved
docs/troubleshooting/host.md Outdated Show resolved Hide resolved
docs/troubleshooting/host.md Outdated Show resolved Hide resolved
docs/troubleshooting/host.md Outdated Show resolved Hide resolved
@w13915984028 w13915984028 force-pushed the doc6264 branch 2 times, most recently from 3515efd to 642b3ab Compare August 12, 2024 14:32
@w13915984028
Copy link
Member Author

@jillian-maroket Thanks.

I have updated the comments, and also change the storageclass.md and create-vm.md with cross-references, to give user hint about the potential affections, please take a new look.

docs/host/host.md Outdated Show resolved Hide resolved
@ibrokethecloud
Copy link
Contributor

@w13915984028 apart from the minor rephrasing the doc looks good to me.

Copy link
Member

@votdev votdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also a little confused about what is correct. In some places we use 'Node Maintenance'. Shouldn't the term 'Maintenance Mode' also be used here?

@jillian-maroket As already mentioned in individual comments, we should clarify whether words such as “StorageClass” or “Maintenance Mode” are ALWAYS specified in inline code.

docs/advanced/storageclass.md Outdated Show resolved Hide resolved
docs/troubleshooting/host.md Outdated Show resolved Hide resolved
docs/troubleshooting/host.md Outdated Show resolved Hide resolved
docs/troubleshooting/host.md Outdated Show resolved Hide resolved
docs/volume/create-volume.md Outdated Show resolved Hide resolved
| Harvester version | Embedded Longhorn version | Default value |
| --- | --- | --- |
| v1.3.1 | v1.6.0 | `true` |
| v1.4.0 | v1.7.0 | `false` |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to mention 1.4.0 in the versioned documentation of 1.3.x? It makes sense for 1.4 (dev), but 1.3 (stable)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no strong option to keep/remove it. This is a later update to v1.3 document, now we have a detailed plan about v1.4, add the section to v1.4 seems no harm.

But this more depends on the suggestion from @jillian-maroket . thanks.

versioned_docs/version-v1.3/troubleshooting/host.md Outdated Show resolved Hide resolved
versioned_docs/version-v1.3/troubleshooting/host.md Outdated Show resolved Hide resolved
versioned_docs/version-v1.3/troubleshooting/host.md Outdated Show resolved Hide resolved
@jillian-maroket
Copy link
Contributor

jillian-maroket commented Aug 15, 2024

@votdev Many of the text blocks that you flagged were recently added. I have yet to figure out which parts are new and which suggestions were applied/rejected. That's why the wording is so inconsistent.

StorageClass is a regular K8s concept so I never format it as inline code. We do not have a convention for official feature/function names yet. Using title case is enough in most situations. We can bold or italicize them for emphasis. Backticks are usually reserved for commands, file paths, and whatever can be classified as code phrases/blocks.

@w13915984028
Copy link
Member Author

w13915984028 commented Aug 16, 2024

Thank for the review from @votdev and the explanation from @jillian-maroket .

I just updated the document per Volker's suggstion, please take a new look.

Btw, we have:

  • Node Maintenance as a section title,
  • Enable Maintenance Mode as menu text & action
  • Maintenance Mode (node); currently as both noun and adjective (?) to describe the node...
  • the UI shows Maitenance on State column

It needs a bit brain burning to select a proper matching word in different context.
Any idea to unify this? thanks.

Copy link
Contributor

@jillian-maroket jillian-maroket left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@w13915984028 You misunderstood. @votdev made wrong assumptions about the markup that we use, and you were NOT supposed to implement his suggestions.

Apply the suggestions in this review so we can finally merge the updates. If you continue to make changes that introduce language issues, I will not approve this PR.

cc: @bk201

@@ -44,6 +44,12 @@ The number of replicas created for each volume in Longhorn. Defaults to `3`.

![](/img/v1.2/storageclass/create_storageclasses_replicas.png)

:::info important

When the value is `1`, the created volume from this StorageClass has only one replica, it may block the [Node Maintenance](../host/host.md#node-maintenance), check the section [Single-Replica Volumes](../troubleshooting/host.md#single-replica-volumes) and set a proper global option.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When the value is `1`, the created volume from this StorageClass has only one replica, it may block the [Node Maintenance](../host/host.md#node-maintenance), check the section [Single-Replica Volumes](../troubleshooting/host.md#single-replica-volumes) and set a proper global option.
Configuring Longhorn to create only one replica for each volume (**Number of Replicas**: `1`) may cause [node maintenance issues](../troubleshooting/host.md#single-replica-volumes). **Number of Replicas** is a global setting, so specify a value that makes sense for your implementation.

Copy link
Member Author

@w13915984028 w13915984028 Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Number of Replicas is a global setting

==
a global setting is not accurate. It only affects the volumes created from this StorageClass.

I guess your comment has a wrong assumption.

User can create many StorageClasses, and when creating Volume he can/may further select which StorageClass to base on, and the default StorgeClass will be used by default. In this context, only those StorageClasses which have replica count 1 matter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first part of your draft mentions a replica count value of 1. The last part tells the user to "set a proper global option".

So the global option that you mentioned was the StorageClass itself and not the replica count? You want users to specify which StorageClass will be used to create volumes by default? If this is correct, include these details in the note.

Copy link
Member Author

@w13915984028 w13915984028 Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems some concepts are mixed.

(1) StorageClass is like a model, user can create as many as he wants;
(2) When creating volume, user may refer any one of his StorageClasses; the parameters from SC are converted to paramters of volume

(3) When LH sees the volumes, it will do many tasks, including create the expected number of replicas for this volume.
(4) Replicas are scheduled to a group of target node(s)
(5) When node maintenance happens, the replica on this node may affect if this node can be successfully drained.
(6) To give user more flexibility, LH has a global option Node Drain Policy to control node drain strategy.

(1) (2) touch StorageClass, (3)~(6) touch LH internal mechanism; they have the connection via a global setting Node Drain Policy.


Back to here, suppose a reader has some k8s backgrounds, he reads this block of text, and click the link to troubleshooting section, he will know what we try to descibe.

docs/host/host.md Outdated Show resolved Hide resolved
docs/host/host.md Outdated Show resolved Hide resolved
docs/host/host.md Outdated Show resolved Hide resolved
@@ -44,6 +44,12 @@ The number of replicas created for each volume in Longhorn. Defaults to `3`.

![](/img/v1.2/storageclass/create_storageclasses_replicas.png)

:::info important

When the value is `1`, the created volume from this `StorageClass` has only one replica, it may block the [Node Maintenance](../host/host.md#node-maintenance), check the section [Single-Replica Volumes](../troubleshooting/host.md#single-replica-volumes) and set a proper global option.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When the value is `1`, the created volume from this `StorageClass` has only one replica, it may block the [Node Maintenance](../host/host.md#node-maintenance), check the section [Single-Replica Volumes](../troubleshooting/host.md#single-replica-volumes) and set a proper global option.
Configuring Longhorn to create only one replica for each volume (**Number of Replicas**: **1**) may prevent you from enabling [Maintenance Mode](../host/host.md#maintenance-mode). The replica count is a global setting, so specify a value that makes sense for your implementation. For troubleshooting information, see [Single-Replica Volumes](../troubleshooting/host.md#single-replica-volumes).

docs/troubleshooting/host.md Outdated Show resolved Hide resolved
docs/troubleshooting/host.md Outdated Show resolved Hide resolved
docs/troubleshooting/host.md Outdated Show resolved Hide resolved
docs/troubleshooting/host.md Outdated Show resolved Hide resolved
docs/volume/create-volume.md Outdated Show resolved Hide resolved
@w13915984028
Copy link
Member Author

@jillian-maroket @votdev

I will follow @jillian-maroket 's suggestions to rework the document, thanks.

@votdev When you have different ideas about the markdown format, please negotiate with @jillian-maroket and leave suggestions on the review, instead of require change, the back-and-force changing of same thing extends the review time and difficulities, thanks.

Copy link
Contributor

@jillian-maroket jillian-maroket left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@w13915984028 Please implement the pending changes so we can finally merge. If I find any more wording issues, I'll just fix them later.

cc: @bk201

@w13915984028
Copy link
Member Author

@w13915984028 Please implement the pending changes so we can finally merge. If I find any more wording issues, I'll just fix them later.

@jillian-maroket

For the remaining comments, please refer this reply: #619 (comment).

As many different concepts/objects are involved in this document PR, it is normal that it is a bit hard to understand, please still read more of the existing documents of those different parts to get a full picture. thanks.

@jillian-maroket
Copy link
Contributor

@w13915984028 I have reviewed this PR several times. In my last comment, I asked you to implement whatever changes are necessary to make the document correct. I even said that I'm willing to fix wording issues later just so we can finally merge this PR. Instead of just fixing the incorrect parts, you chose to explain at length what I misunderstood.

@bk201 I have reached my limit. As you know, his drafts are always difficult to understand and review. The PR is massive and it changed a lot over time because of reviewer feedback and because he decided to add more content in different places. Please evaluate if what we currently have is good enough to merge. I am supposed to be working on the doc conversion and Longhorn v1.7.1 deliverables this week.

@w13915984028
Copy link
Member Author

A third commit is added to remove the description of set a proper global option. It was meant to be LH setting, but was a bit confusing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants