Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

routing: shutdown chanrouter correctly. #8497

Merged
merged 6 commits into from
Aug 1, 2024

Conversation

ziggie1984
Copy link
Collaborator

@ziggie1984 ziggie1984 commented Feb 22, 2024

Fixes #8489
EDIT: Fixes #8721

So in the above linked issue, the channel graph could not be synced correctly so the ChanRouter:

2024-02-20 11:18:39.217 [INF] CRTR: Syncing channel graph from height=830127 (hash=00000000000000000003a7ed3b7a5f5fd5571a658972e9db0af2a650f6ade198) to height=831246 (hash=00000000000000000002946973960d53538a7d93333ff7d4653a37a577ba4b58)

...

2024-02-20 11:19:11.325 [WRN] LNWL: Query(34) from peer 142.132.193.144:8333 failed, rescheduling: did not get response before timeout
2024-02-20 11:19:11.326 [DBG] BTCN: Sending getdata (witness block 000000000000000000028352e09a42f6d26d0514a3d483f7f1fb56b2c2954361) to 142.132.193.144:8333 (outbound)
2024-02-20 11:19:15.327 [WRN] LNWL: Query(34) from peer 142.132.193.144:8333 failed and reached maximum number of retries, not rescheduling: did not get response before timeout
2024-02-20 11:19:15.327 [DBG] LNWL: Canceled batch 34
2024-02-20 11:19:15.328 [INF] DISC: Authenticated gossiper shutting down
2024-02-20 11:19:15.328 [INF] DISC: Authenticated Gossiper is stopping

so the 34 query failed and therefore the startup of the chanrouter failed as well.

We fail here and never call the Stop function of the channel router.
https://github.com/lightningnetwork/lnd/blob/master/routing/router.go#L628

When cleaning up all the other subsystems we get stuck however:

goroutine 1652 [select]:
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).UpdateEdge(0xc0002190e0, 0xc00028fea0, {0x0, 0x0, 0x0})
        github.com/lightningnetwork/lnd/routing/router.go:2605 +0x155
github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).updateChannel(0xc0004a2790, 0xc0004f0580, 0xc00028fea0)
        github.com/lightningnetwork/lnd/discovery/gossiper.go:2182 +0x1f1
github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).retransmitStaleAnns(0xc0004a2790, {0x0?, 0x100c004d10870?, 0x31f6c60?})
        github.com/lightningnetwork/lnd/discovery/gossiper.go:1643 +0x272
github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).networkHandler(0xc0004a2790)
        github.com/lightningnetwork/lnd/discovery/gossiper.go:1342 +0x19d
created by github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).start in goroutine 1
        github.com/lightningnetwork/lnd/discovery/gossiper.go:599 +0x145

because we don't close the quit channel of the channel router and therefore the Authenticated Gossiper cannot stop as well so the cleanup process is stuck holding up the shutdown of all subsystems, causing some sideeffects because other subsystems are still running.

2024-02-20 11:19:15.328 [INF] DISC: Authenticated gossiper shutting down
2024-02-20 11:19:15.328 [INF] DISC: Authenticated Gossiper is stopping

Goroutine Dump:

goroutine 1 [semacquire]:
sync.runtime_Semacquire(0xc0039a30e0?)
        runtime/sema.go:62 +0x25
sync.(*WaitGroup).Wait(0xc0008ff7a0?)
        sync/waitgroup.go:116 +0x48
github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).stop(0xc0004a2790)
        github.com/lightningnetwork/lnd/discovery/gossiper.go:746 +0x115
github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).Stop.func1()
        github.com/lightningnetwork/lnd/discovery/gossiper.go:732 +0x69
sync.(*Once).doSlow(0x3?, 0xc00030e6a0?)
        sync/once.go:74 +0xbf
sync.(*Once).Do(...)
        sync/once.go:65
github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).Stop(0xc0039a32d8?)
        github.com/lightningnetwork/lnd/discovery/gossiper.go:730 +0x3c
github.com/lightningnetwork/lnd.cleaner.run({0xc001c1bc00, 0x1e209aa?, 0x4?})
        github.com/lightningnetwork/lnd/server.go:1858 +0x42github.com/lightningnetwork/lnd.(*server).Start(0xcb01c?)
        github.com/lightningnetwork/lnd/server.go:2248 +0x8egithub.com/lightningnetwork/lnd.Main(0xc0001d0100, {{0x0?, 0x7f4703ac7c40?, 0x101c0000b2000?}}, 0xc000104f60, {0xc0000b3e60, 0xc000222180, 0xc0002221e0, 0xc000222240, {0x0}})
        github.com/lightningnetwork/lnd/lnd.go:684 +0x3be5
main.main()
        github.com/lightningnetwork/lnd/cmd/lnd/main.go:38 +0x1ee

So we need to think how to prevent those situations, because I think we don't close the quit channel for almost all subsystems when the start fails.

Copy link
Contributor

coderabbitai bot commented Feb 22, 2024

Important

Review skipped

Auto reviews are limited to specific labels.

Labels to auto review (1)
  • llm-review

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

The recent changes introduce robust error handling and state management in various components of the Lightning Network Daemon (LND). Key enhancements include ensuring that certain methods are only executed when their corresponding components have been initialized, preventing nil pointer dereferences and managing lifecycle states with atomic boolean flags. These improvements enhance the stability and reliability of the system during startup and shutdown processes.

Changes

Files Change Summary
chainntnfs/.../bitcoind.go Added nil checks for txNotifier in Stop method to prevent nil pointer dereference.
chainntnfs/.../neutrino.go Similar nil checks for txNotifier added in Stop method.
chanfitness/.../chaneventstore.go Introduced started and stopped atomic boolean fields; modified Start and Stop methods for state management and error handling.
discovery/.../gossiper.go Added nil check for blockEpochs in stop method to prevent panics.
docs/release-notes/.../release-notes-0.18.3.md Fixed bugs related to fee rates during batch channel openings and improved shutdown handling, enhancing overall stability.
graph/.../builder.go Improved logging in Start and Stop methods for better monitoring and debugging.
htlcswitch/.../interceptable_switch.go Added state tracking with started and stopped flags in InterceptableSwitch; enhanced method checks to prevent multiple invocations.
invoices/.../invoiceregistry.go Similar state tracking enhancements for InvoiceRegistry methods to ensure proper lifecycle management.
lnd.go Modified server startup to be asynchronous, allowing better error handling and graceful shutdowns.
server.go Enhanced cleanup logic in Start and improved error handling in Stop for better lifecycle management.
sweep/.../fee_bumper.go Added state management with atomic booleans in TxPublisher, changing Stop method to return errors for better control flow.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Server
    participant Component

    User->>Server: Start
    Server->>Component: Initialize
    Component-->>Server: Initialized
    Server->>User: Success

    User->>Server: Stop
    Server->>Component: Cleanup
    Component-->>Server: Cleaned
    Server->>User: Success
Loading

Assessment against linked issues

Objective Addressed Explanation
Ensure node remains synced and channels reconnect after restart (#8489) Changes don't directly address the sync issue.
Allow ChannelRouter to be shutdown while syncGraphWithChain runs (#8721) Introduced checks to handle shutdown during operations.

Possibly related issues

🐇 In the meadow, we leap and play,
Fixing bugs, keeping chaos at bay.
With atomic states we hop and bound,
In our code, no errors are found.
So let's celebrate this fine endeavor,
Together we'll make LND clever! 🌼


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@ziggie1984
Copy link
Collaborator Author

But this is just part of the fix why the other node is not able to sync the graph to the chain, we definitely need to retry the blockfetch and not fail immediately if we cannot get the block from the first peer. This issue is already tracked in this issue:

btcsuite/btcwallet#904

@ziggie1984
Copy link
Collaborator Author

cc @yyforyongyu @Roasbeef

@saubyk saubyk added this to the v0.18.0 milestone Feb 25, 2024
@ziggie1984 ziggie1984 marked this pull request as ready for review April 4, 2024 15:37
@ziggie1984
Copy link
Collaborator Author

Swapped the order when we add the cleanup stop function to the garbage-collector. Let's see if tests pass.

routing/router.go Outdated Show resolved Hide resolved
server.go Show resolved Hide resolved
server.go Show resolved Hide resolved
@ziggie1984
Copy link
Collaborator Author

@yyforyongyu while adding the interruptibility to the startup of the server, I figured out that we need to make sure that each stop call is atomic (only happens once) otherwise we first call it in the cleanup method and when we return an error its also called in the server.Stop() function. While testing I had such a panic with the invoice registry but also the txpublisher.

But I think when the tests pass the switch of the cleanup order should have no side effects and can prevent some cases where subsystems depend on each other and therefore cannot shutdown correctly in case on of them does not close the quit channel.

@saubyk saubyk added the P1 MUST be fixed or reviewed label Jun 25, 2024
@lightninglabs-deploy
Copy link

@ziggie1984, remember to re-request review from reviewers when ready

Copy link
Member

@yyforyongyu yyforyongyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, will check the itest logs to understand more about the new behavior.

chanfitness/chaneventstore.go Outdated Show resolved Hide resolved
chanfitness/chaneventstore.go Outdated Show resolved Hide resolved
chanfitness/chaneventstore.go Outdated Show resolved Hide resolved
docs/release-notes/release-notes-0.18.1.md Outdated Show resolved Hide resolved
invoices/invoiceregistry.go Outdated Show resolved Hide resolved
lnd.go Outdated Show resolved Hide resolved
@ziggie1984 ziggie1984 force-pushed the shutdown-bugfix branch 2 times, most recently from 0ac6836 to 5508ca3 Compare July 11, 2024 23:25
server.go Show resolved Hide resolved
@ziggie1984
Copy link
Collaborator Author

Let's see whether all the itests pass after the change to error out when a start/stop is called twice.

Copy link
Member

@yyforyongyu yyforyongyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, just a few nits and needs a rebase - think there's a new subserver added, we may need to change that here too.

chanfitness/chaneventstore.go Show resolved Hide resolved
chanfitness/chaneventstore.go Outdated Show resolved Hide resolved
chanfitness/chaneventstore.go Outdated Show resolved Hide resolved
invoices/invoiceregistry.go Show resolved Hide resolved
server.go Show resolved Hide resolved
sweep/fee_bumper.go Show resolved Hide resolved
docs/release-notes/release-notes-0.18.1.md Outdated Show resolved Hide resolved
@ziggie1984 ziggie1984 force-pushed the shutdown-bugfix branch 3 times, most recently from 928ef1a to 73c6a3b Compare July 23, 2024 09:55
@ellemouton
Copy link
Collaborator

ellemouton commented Jul 25, 2024

just gonna make a note of the one's ive run into:

  1. the authgossiper one mentioned in my review
  2. Ran into this panic
  3. The InterceptableSwitch s.blockEpochStream.Cancel() in Stop panics.
  4. nil pointer dereference of close(n.quit) in (n *TxNotifier) TearDown() which is caused cause the TxNotifier constructors are called in the various notifier Start methods (and not the constructors). So TearDown is called on a nil txNotifier.

Found these by basically commenting out all the Start calls & thus only calling Stop calls

@yyforyongyu
Copy link
Member

I back the change. However, I think it will end up revealing some panics though in the cases where Stop methods depend on certain pointer members being set which are only set in Start methods. But in those cases, we should anyways either always set variables in the service constructors or we should do nil checks in the Stop methods where appropriate.

Good observation - I think it means if we want to safely move Stop before Start, we need to check all the Start methods and see if there are any struct initialization that should happen instead in New.

@ziggie1984
Copy link
Collaborator Author

Thank you for this important analysis, did not think about this, will try to analysis all the cases and provide a proper solution.

@ziggie1984 ziggie1984 force-pushed the shutdown-bugfix branch 2 times, most recently from a488da4 to 5209f80 Compare July 27, 2024 12:43
@ziggie1984
Copy link
Collaborator Author

the authgossiper one mentioned in my review
Ran into this panic
The InterceptableSwitch s.blockEpochStream.Cancel() in Stop panics.
nil pointer dereference of close(n.quit) in (n *TxNotifier) TearDown() which is caused cause the TxNotifier constructors are called in the various notifier Start methods (and not the constructors). So TearDown is called on a nil txNotifier.

went through the list of stop/start methods and tried mostly added nil pointer checks in the stop() methods, because the design always kind of required the variables to be initialized in the start method (from my technical understanding).

Also went through your list of examples above and addressed them, only exception was the panic you refered to in point 2, which was caused by the chainnotifier not running. However we already start the chainnotifier before the SubSwapper which is then able to subscribe to the channel events.

I don't think I regarded every case in the code base where the stop method is called before the start method, but I focused on the subsystems changed by this PR.

@ziggie1984 ziggie1984 requested a review from ellemouton July 27, 2024 12:52
Copy link
Collaborator

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates 🙏

lnd.go Show resolved Hide resolved
chanfitness/chaneventstore.go Outdated Show resolved Hide resolved
discovery/gossiper.go Outdated Show resolved Hide resolved
Comment on lines 264 to 271
if i.expiryWatcher == nil {
return fmt.Errorf("InvoiceRegistry expiryWatcher not " +
"initialized")
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we still want the rest of the function to run though. iiuc, the whole reason we want to call Stop before Start so that quit channels can be closed & hence synch processes in Start methods can be stopped

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree changed it to:

	if i.expiryWatcher != nil {
		i.expiryWatcher.Stop()
	}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could also change the constructor of the invoice registry to catch the case where we have a nil pointer for the expiryWatcher. Went with the above for now, but happy to change it.

@ziggie1984
Copy link
Collaborator Author

@coderabbitai review

Copy link
Contributor

coderabbitai bot commented Jul 29, 2024

Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Outside diff range, codebase verification and nitpick comments (2)
sweep/fee_bumper.go (1)

703-703: Ensure correct initialization of the new fields in the NewTxPublisher function.

The NewTxPublisher function should correctly initialize the new atomic boolean fields started and stopped. Currently, these fields are not being initialized.

  • File: sweep/fee_bumper.go
  • Function: NewTxPublisher

Please add the initialization for the started and stopped fields.

Analysis chain

Ensure correct initialization of the new fields in the NewTxPublisher function.

The NewTxPublisher function should correctly initialize the new atomic boolean fields started and stopped.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the `NewTxPublisher` function initializes the new fields correctly.

# Test: Search for the `NewTxPublisher` function implementation. Expect: Correct initialization of the `started` and `stopped` fields.
rg --type go -A 10 $'func NewTxPublisher(cfg TxPublisherConfig) *TxPublisher {'

Length of output: 258


Script:

#!/bin/bash
# Description: Verify the `NewTxPublisher` function initializes the new fields correctly.

# Test: Search for the `NewTxPublisher` function implementation. Expect: Correct initialization of the `started` and `stopped` fields.
rg --type go -A 20 'func NewTxPublisher\(cfg TxPublisherConfig\) \*TxPublisher {'

Length of output: 267


Script:

#!/bin/bash
# Description: Verify the `NewTxPublisher` function initializes the new fields correctly.

# Test: Search for the `NewTxPublisher` function implementation. Expect: Correct initialization of the `started` and `stopped` fields.
rg --type go -A 20 'func NewTxPublisher(cfg TxPublisherConfig) *TxPublisher {'

Length of output: 258


Script:

#!/bin/bash
# Description: Verify the `NewTxPublisher` function initializes the new fields correctly.

# Test: Search for the `NewTxPublisher` function implementation. Expect: Correct initialization of the `started` and `stopped` fields.
rg --type go -A 20 'func NewTxPublisher'

Length of output: 1276

invoices/invoiceregistry.go (1)

273-273: Ensure correct initialization of the new fields in the NewRegistry function.

The NewRegistry function does not initialize the new atomic boolean fields started and stopped. Please add the initialization for these fields.

  • invoices/invoiceregistry.go: NewRegistry function
Analysis chain

Ensure correct initialization of the new fields in the NewRegistry function.

The NewRegistry function should correctly initialize the new atomic boolean fields started and stopped.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the `NewRegistry` function initializes the new fields correctly.

# Test: Search for the `NewRegistry` function implementation. Expect: Correct initialization of the `started` and `stopped` fields.
rg --type go -A 10 $'func NewRegistry(idb InvoiceDB, expiryWatcher *InvoiceExpiryWatcher, cfg *RegistryConfig) *InvoiceRegistry {'

Length of output: 411


Script:

#!/bin/bash
# Description: Verify the `NewRegistry` function initializes the new fields correctly.

# Test: Search for the `NewRegistry` function implementation. Expect: Correct initialization of the `started` and `stopped` fields.
rg --type go -A 20 'func NewRegistry' 

Length of output: 1439

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between b7c59b3 and 8c3abec.

Files selected for processing (11)
  • chainntnfs/bitcoindnotify/bitcoind.go (1 hunks)
  • chainntnfs/neutrinonotify/neutrino.go (1 hunks)
  • chanfitness/chaneventstore.go (4 hunks)
  • discovery/gossiper.go (1 hunks)
  • docs/release-notes/release-notes-0.18.3.md (1 hunks)
  • graph/builder.go (3 hunks)
  • htlcswitch/interceptable_switch.go (4 hunks)
  • invoices/invoiceregistry.go (2 hunks)
  • lnd.go (1 hunks)
  • server.go (6 hunks)
  • sweep/fee_bumper.go (3 hunks)
Additional context used
LanguageTool
docs/release-notes/release-notes-0.18.3.md

[grammar] ~40-~40: The word “shutdown” is a noun. The verb is spelled with a space.
Context: ...pull/8497) a case where LND would not shutdown properly when interrupted via e.g. SIGT...

(NOUN_VERB_CONFUSION)


[uncategorized] ~40-~40: A comma may be missing after the conjunctive/linking adverb ‘Moreover’.
Context: ...erly when interrupted via e.g. SIGTERM. Moreover LND now shutsdown correctly in case o...

(SENT_START_CONJUNCTIVE_LINKING_ADVERB_COMMA)

Markdownlint
docs/release-notes/release-notes-0.18.3.md

39-39: Expected: dash; Actual: asterisk
Unordered list style

(MD004, ul-style)

Additional comments not posted (57)
chanfitness/chaneventstore.go (4)

15-17: LGTM: Import of sync/atomic.

The import of sync/atomic is necessary for the atomic boolean flags used in the file.


53-55: LGTM: Addition of started and stopped fields.

The addition of started and stopped fields of type atomic.Bool enhances the lifecycle management of the ChannelEventStore.


150-154: LGTM: Prevent multiple invocations of Start.

The check using c.started.Swap(true) ensures that the Start method cannot be invoked more than once, preventing potential issues.


216-238: LGTM: Prevent multiple invocations of Stop and return an error.

The check using c.stopped.Swap(true) ensures that the Stop method cannot be invoked more than once, and returning an error adds robustness to the shutdown process.

htlcswitch/interceptable_switch.go (4)

7-7: LGTM: Import of sync/atomic.

The import of sync/atomic is necessary for the atomic boolean flags used in the file.


37-39: LGTM: Addition of started and stopped fields.

The addition of started and stopped fields of type atomic.Bool enhances the lifecycle management of the InterceptableSwitch.


208-213: LGTM: Prevent multiple invocations of Start.

The check using s.started.Swap(true) ensures that the Start method cannot be invoked more than once, preventing potential issues.


236-251: LGTM: Prevent multiple invocations of Stop and return an error.

The check using s.stopped.Swap(true) ensures that the Stop method cannot be invoked more than once, and returning an error adds robustness to the shutdown process.

lnd.go (4)

677-684: Good use of asynchronous server startup.

Starting the server within a goroutine allows for better control flow and graceful shutdown capabilities.


685-691: Good use of deferred function for graceful shutdown.

The deferred function ensures that the server is stopped gracefully, logging any errors that occur during the shutdown process.


693-702: Improved control flow with select statement.

The select statement allows for better error handling and graceful shutdown capabilities by listening for either an error from errChan or a shutdown signal from interceptor.ShutdownChannel().


700-702: Proper use of break statement in select block.

The break statement ensures that the function proceeds normally if the server starts successfully.

chainntnfs/bitcoindnotify/bitcoind.go (1)

154-159: Robustness improvement with nil check for txNotifier.

The conditional statement ensures that txNotifier is not nil before attempting to call TearDown, preventing runtime errors associated with dereferencing a nil pointer.

chainntnfs/neutrinonotify/neutrino.go (1)

156-160: Robustness Improvement: Conditional Check for txNotifier.

The added conditional check ensures that n.txNotifier is not nil before calling its TearDown method. This prevents potential nil pointer dereference errors and enhances the robustness of the Stop method.

sweep/fee_bumper.go (4)

264-265: LGTM! The addition of atomic boolean fields improves state management.

The started and stopped fields ensure that the TxPublisher can only be started or stopped once, enhancing robustness.


Line range hint 672-686:
LGTM! The Start method now ensures single invocation.

The use of t.started.Swap(true) prevents multiple starts, and the logging statement aligns better with the logical flow.


691-703: LGTM! The Stop method now ensures single invocation and returns an error.

The use of t.stopped.Swap(true) prevents multiple stops, and the method signature update aligns with Go best practices.


703-703: Ensure proper handling of the quit channel in the monitor method.

The monitor method should correctly handle the quit channel, which is closed in the Stop method, to ensure a graceful shutdown.

Verification successful

The monitor method correctly handles the quit channel.

The monitor method includes a select statement that properly handles the quit channel, ensuring a graceful shutdown when the quit signal is received.

  • The quit channel is checked within the select statement, and the method returns when a message is received on the quit channel.

Code Snippet:

for {
	select {
	case epoch, ok := <-blockEvent.Epochs:
		if !ok {
			log.Error("Block epoch channel closed, exit monitor")
			return
		}
		log.Debugf("TxPublisher received new block: %v", epoch.Height)
		t.currentHeight.Store(epoch.Height)
		t.processRecords()
	case <-t.quit:
		log.Debug("Fee bumper stopped, exit monitor")
		return
	}
}
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the `monitor` method handles the `quit` channel correctly.

# Test: Search for the `quit` channel usage in the `monitor` method. Expect: Proper handling of the `quit` channel.
rg --type go -A 10 $'func (t *TxPublisher) monitor(blockEvent *chainntnfs.BlockEpochEvent) {'

Length of output: 300


Script:

#!/bin/bash
# Description: Verify the `monitor` method handles the `quit` channel correctly.

# Test: Search for the `monitor` method and include more lines to capture the full method implementation.
rg --type go -A 50 'func (t *TxPublisher) monitor(blockEvent *chainntnfs.BlockEpochEvent) {'

Length of output: 300


Script:

#!/bin/bash
# Description: Verify the `monitor` method handles the `quit` channel correctly.

# Test: Search for the `monitor` method and include more lines to capture the full method implementation.
rg --type go -A 50 'func (t \*TxPublisher\) monitor(blockEvent \*chainntnfs.BlockEpochEvent) {'

Length of output: 309


Script:

#!/bin/bash
# Description: Verify the `monitor` method handles the `quit` channel correctly.

# Test: Search for the `monitor` method and include more lines to capture the full method implementation.
rg --type go -A 50 'func (t *TxPublisher) monitor(blockEvent *chainntnfs.BlockEpochEvent) {'

Length of output: 300


Script:

#!/bin/bash
# Description: Verify the `monitor` method handles the `quit` channel correctly.

# Test: Use `ast-grep` to search for the `monitor` method and include more lines to capture the full method implementation.
ast-grep --lang go --pattern $'func (t *TxPublisher) monitor(blockEvent *chainntnfs.BlockEpochEvent) {
  $$$
}'

Length of output: 1650

invoices/invoiceregistry.go (4)

104-105: LGTM! The addition of atomic boolean fields improves state management.

The started and stopped fields ensure that the InvoiceRegistry can only be started or stopped once, enhancing robustness.


219-250: LGTM! The Start method now ensures single invocation.

The use of i.started.Swap(true) prevents multiple starts, and the logging statement provides clear information about the state transitions.


255-273: LGTM! The Stop method now ensures single invocation and includes logging.

The use of i.stopped.Swap(true) prevents multiple stops, and the logging statements enhance visibility into the state transitions.


273-273: Ensure proper handling of the quit channel in the invoiceEventLoop method.

The invoiceEventLoop method should correctly handle the quit channel, which is closed in the Stop method, to ensure a graceful shutdown.

Verification successful

Let's correct the regex pattern and rerun the script to verify the handling of the quit channel in the invoiceEventLoop method.


Let's simplify the regex pattern and rerun the script to verify the handling of the quit channel in the invoiceEventLoop method.


Let's simplify the regex pattern even further and rerun the script to verify the handling of the quit channel in the invoiceEventLoop method.


Let's use ast-grep to search for the invoiceEventLoop method and verify the handling of the quit channel.


Ensure proper handling of the quit channel in the invoiceEventLoop method.

The invoiceEventLoop method correctly handles the quit channel by returning from the loop when the quit channel is closed, ensuring a graceful shutdown.

  • invoices/invoiceregistry.go:348-349: The select statement listens for the quit channel and returns from the method when it receives a signal.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the `invoiceEventLoop` method handles the `quit` channel correctly.

# Test: Search for the `quit` channel usage in the `invoiceEventLoop` method. Expect: Proper handling of the `quit` channel.
rg --type go -A 10 $'func (i *InvoiceRegistry) invoiceEventLoop() {'

Length of output: 225


Script:

#!/bin/bash
# Description: Verify the `invoiceEventLoop` method handles the `quit` channel correctly.

# Test: Search for the `quit` channel usage in the `invoiceEventLoop` method. Expect: Proper handling of the `quit` channel.
rg --type go -A 20 'func (i \*InvoiceRegistry\) invoiceEventLoop\(\) {'

Length of output: 237


Script:

#!/bin/bash
# Description: Verify the `invoiceEventLoop` method handles the `quit` channel correctly.

# Test: Search for the `quit` channel usage in the `invoiceEventLoop` method. Expect: Proper handling of the `quit` channel.
rg --type go -A 20 'func \(i \*InvoiceRegistry\) invoiceEventLoop\(\) {'

Length of output: 240


Script:

#!/bin/bash
# Description: Verify the `invoiceEventLoop` method handles the `quit` channel correctly.

# Test: Search for the `quit` channel usage in the `invoiceEventLoop` method. Expect: Proper handling of the `quit` channel.
rg --type go -A 20 'func .*invoiceEventLoop.*\(\) {'

Length of output: 180


Script:

#!/bin/bash
# Description: Verify the `invoiceEventLoop` method handles the `quit` channel correctly.

# Test: Search for the `invoiceEventLoop` method and its handling of the `quit` channel.
ast-grep --lang go --pattern $'func (i *InvoiceRegistry) invoiceEventLoop() {\n  $$$\n}'

Length of output: 3609

graph/builder.go (2)

303-303: Approved: Log statement addition in Start function.

The debug log statement log.Debug("Builder started") enhances visibility into the startup process of the builder.


329-329: Approved: Log statement modification in Stop function.

The debug log statement log.Debug("Builder shutdown complete") is now placed correctly to log immediately after the shutdown sequence is concluded.

discovery/gossiper.go (1)

756-761: Robustness Improvement: Added nil check for d.blockEpochs

The addition of the nil check before calling d.blockEpochs.Cancel() prevents potential panics if d.blockEpochs is not initialized. This enhances the robustness of the Stop method and ensures a safer shutdown process.

server.go (32)

Line range hint 1883-1893:
Initialize cleanup with the first subsystem.

The cleanup variable is initialized and the first subsystem (customMessageServer) is added to the cleanup list. This ensures that if any subsequent subsystem fails to start, the already started subsystems will be stopped in reverse order.


1900-1900: Add host announcer to cleanup list.

The hostAnn subsystem is conditionally added to the cleanup list, ensuring it is stopped if the startup process fails.


1908-1908: Add liveness monitor to cleanup list.

The livenessMonitor subsystem is conditionally added to the cleanup list, ensuring it is stopped if the startup process fails.


1920-1920: Add signature pool to cleanup list.

The sigPool subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.


1926-1926: Add write pool to cleanup list.

The writePool subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.


1932-1932: Add read pool to cleanup list.

The readPool subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.


1938-1938: Add chain notifier to cleanup list.

The cc.ChainNotifier subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.


1944-1944: Add best block tracker to cleanup list.

The cc.BestBlockTracker subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.


1950-1950: Add channel notifier to cleanup list.

The channelNotifier subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.


1956-1958: Add peer notifier to cleanup list.

The peerNotifier subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.


1964-1964: Add HTLC notifier to cleanup list.

The htlcNotifier subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.


1971-1971: Add tower client manager to cleanup list.

The towerClientMgr subsystem is conditionally added to the cleanup list, ensuring it is stopped if the startup process fails.


1978-1978: Add transaction publisher to cleanup list.

The txPublisher subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.


1984-1984: Add UTXO sweeper to cleanup list.

The sweeper subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.


1990-1990: Add UTXO nursery to cleanup list.

The utxoNursery subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.


1996-1996: Add breach arbitrator to cleanup list.

The breachArbitrator subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.


2002-2002: Add funding manager to cleanup list.

The fundingMgr subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.


2011-2011: Add HTLC switch to cleanup list.

The htlcSwitch subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.


2017-2017: Add interceptable switch to cleanup list.

The interceptableSwitch subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.


2023-2023: Add chain arbitrator to cleanup list.

The chainArb subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.


2029-2030: Add graph builder to cleanup list.

The graphBuilder subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.


2035-2036: Add channel router to cleanup list.

The chanRouter subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.


2042-2043: Add authenticated gossiper to cleanup list.

The authGossiper subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.


2048-2048: Add invoices registry to cleanup list.

The invoices subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.


2054-2054: Add sphinx to cleanup list.

The sphinx subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.


2060-2060: Add channel status manager to cleanup list.

The chanStatusMgr subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.


2066-2066: Add channel event store to cleanup list.

The chanEventStore subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.


2113-2113: Add channel sub swapper to cleanup list.

The chanSubSwapper subsystem is added to the cleanup list, ensuring it is stopped if the startup process fails.


2120-2120: Add Tor controller to cleanup list.

The torController subsystem is conditionally added to the cleanup list, ensuring it is stopped if the startup process fails.


2137-2137: Start connection manager last.

The connMgr is started last to prevent connections before initialization is complete. This ensures that all necessary subsystems are up and running before accepting connections.


2324-2326: Add error handling for txPublisher.Stop.

The txPublisher.Stop method now includes error handling to log any issues encountered during the stop process.


2346-2349: Add channel event store to stop process.

The chanEventStore.Stop method is now included in the stop process, ensuring it is properly stopped and any errors are logged.

docs/release-notes/release-notes-0.18.3.md Show resolved Hide resolved
docs/release-notes/release-notes-0.18.3.md Outdated Show resolved Hide resolved
Copy link
Member

@yyforyongyu yyforyongyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think we are missing a few nil check,

diff --git a/htlcswitch/link.go b/htlcswitch/link.go
index f39a12b2b..7b5b60295 100644
--- a/htlcswitch/link.go
+++ b/htlcswitch/link.go
@@ -533,6 +533,7 @@ func (l *channelLink) Start() error {
 		}()
 	}
 
+	// Needs to check this.
 	l.updateFeeTimer = time.NewTimer(l.randomFeeUpdateTimeout())
 
 	l.wg.Add(1)
diff --git a/lnwallet/chainfee/estimator.go b/lnwallet/chainfee/estimator.go
index d9a402964..0f291b724 100644
--- a/lnwallet/chainfee/estimator.go
+++ b/lnwallet/chainfee/estimator.go
@@ -860,6 +860,7 @@ func (w *WebAPIEstimator) Start() error {
 	log.Infof("Web API fee estimator using update timeout of %v",
 		feeUpdateTimeout)
 
+	// Needs to check this.
 	w.updateFeeTicker = time.NewTicker(feeUpdateTimeout)
 
 	w.wg.Add(1)
diff --git a/tor/controller.go b/tor/controller.go
index 47ea6e129..9c5eb13d6 100644
--- a/tor/controller.go
+++ b/tor/controller.go
@@ -164,6 +164,7 @@ func (c *Controller) Start() error {
 		return fmt.Errorf("unable to connect to Tor server: %w", err)
 	}
 
+	// Need check this.
 	c.conn = conn
 
 	return c.authenticate()

chanfitness/chaneventstore.go Show resolved Hide resolved
chainntnfs/bitcoindnotify/bitcoind.go Show resolved Hide resolved
discovery/gossiper.go Show resolved Hide resolved
invoices/invoiceregistry.go Outdated Show resolved Hide resolved
Make sure that each subsystem only starts and stop once. This makes
sure we don't close e.g. quit channels twice.
This commit does two things. It starts up the server in a way that
it can be interrupted and shutdown gracefully.
Moreover it makes sure that subsystems clean themselves up when
they fail to start. This makes sure that depending subsytems can
shutdown gracefully as well and the shutdown process is not stuck.
chainntnfs/bitcoindnotify/bitcoind.go Show resolved Hide resolved
invoices/invoiceregistry.go Outdated Show resolved Hide resolved
With this PR we might call the stop method even when the start
method of a subsystem did not successfully finish therefore we
need to make sure we guard the stop methods for potential panics
if some variables are not initialized in the contructors of the
subsystems.
Copy link
Member

@yyforyongyu yyforyongyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM🙏 Would love to see some unit tests but it cannot be done atm. Have some rough ideas about how to implement #8958. I tried my best to check all the possible nil-panic cases, but it's only a pair of human eyes. Think we should proceed quickly to #8958 and add tests there.

Copy link
Collaborator

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙏

Comment on lines 231 to 239
err = fmt.Errorf("ChannelEventStore FlapCountTicker not " +
"initialized")
} else {
c.cfg.FlapCountTicker.Stop()
}

log.Debugf("ChannelEventStore shutdown complete")

return nil
return err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non blocking: i'd say this is an error worth logging but not returning. The Stop function itself did not error here, it was just that Start never ran/completed. cause this makes it seem like "error stopping chanEventStore" even though there wasnt really an error stopping it. But defs not a big deal

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment for a few other spots in this commit

@Roasbeef Roasbeef merged commit 4a3c4e4 into lightningnetwork:master Aug 1, 2024
31 of 34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 MUST be fixed or reviewed
Projects
None yet
6 participants