Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protect misconfiguring catchupModeBlockGap < catchupModePageSize and fix catchup #199

Merged
merged 2 commits into from
Feb 14, 2022

Conversation

peterbroadhurst
Copy link
Contributor

@peterbroadhurst peterbroadhurst commented Feb 4, 2022

See #198 for the 🤕 behavior you can get if we don't protect against this.

The logic here has an implicit assumption of this rule, so that when we read a page it's never going to extend past the blockNumber of the header of the chain at the point we query.

Otherwise, we think:

  • Gap to head 1234 is > 250 - so we're in catchup
  • Read 5000 (page size) - cool we've read 5000 blocks ... but actually there were only 1234 to read!!!
  • ok, now we're not in catchup any more - start reading from where we left off - but that's now skipped a whole bunch of blocks.

func (s *subscription) restartFilter(ctx context.Context, checkpoint *big.Int) error {
ctx, cancel := context.WithTimeout(ctx, 30*time.Second)
defer cancel()
since := checkpoint
if s.catchupBlock != nil {
// If we're already in catchup mode, we need to look at the current catchupBlock,
// not the checkpoint.
since = s.catchupBlock
}
blockNumber := ethbinding.HexBigInt{}
err := s.rpc.CallContext(ctx, &blockNumber, "eth_blockNumber")
if err != nil {
return errors.Errorf(errors.RPCCallReturnedError, "eth_blockNumber", err)
}
blockGap := new(big.Int).Sub(blockNumber.ToInt(), since).Int64()
log.Debugf("%s: restarting. Head=%s Position=%s Gap=%d (catchup threshold: %d)", s.logName, blockNumber.ToInt().String(), since.String(), blockGap, s.catchupModeBlockGap)
if s.catchupModeBlockGap > 0 && blockGap > s.catchupModeBlockGap {
s.catchupBlock = since // note if we were already in catchup, this does not change anything
return nil
}
return s.createFilter(ctx, since)
}

@codecov-commenter
Copy link

codecov-commenter commented Feb 4, 2022

Codecov Report

Merging #199 (35dbd32) into main (f6c7fe5) will increase coverage by 0.01%.
The diff coverage is n/a.

❗ Current head 35dbd32 differs from pull request most recent head 55d5515. Consider uploading reports for the commit 55d5515 to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #199      +/-   ##
==========================================
+ Coverage   97.21%   97.23%   +0.01%     
==========================================
  Files          58       58              
  Lines        6897     6900       +3     
==========================================
+ Hits         6705     6709       +4     
+ Misses        147      146       -1     
  Partials       45       45              
Impacted Files Coverage Δ
ethconnect/internal/events/eventstream.go 97.16% <0.00%> (-0.01%) ⬇️
ethconnect/internal/events/submanager.go 100.00% <0.00%> (ø)
ethconnect/internal/events/logprocessor.go 98.00% <0.00%> (+1.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f6c7fe5...55d5515. Read the comment docs.

@peterbroadhurst peterbroadhurst changed the title Protect users for misconfiguring catchupModeBlockGap < catchupModePageSize Protect misconfiguring catchupModeBlockGap < catchupModePageSize and fix catchup Feb 4, 2022
@peterbroadhurst
Copy link
Contributor Author

I found we still were not writing checkpoint HWM updates while replaying in catchup mode.
There had been an attempted fix in #157, but it didn't get to the root cause - so I've backed that out.

There was a check here that architecturally was meant to address this when #130 was first implemented, but it has a special case of failing when highestDispatched is 0:

if lp.highestDispatched.Cmp(&lp.blockHWM) < 0 {

So I've just made highestDispatched init to -1.

@@ -428,14 +427,12 @@ func (a *eventStream) eventPoller() {
blockHeight, exists := checkpoint[sub.info.ID]
if !exists || blockHeight.Cmp(big.NewInt(0)) <= 0 {
blockHeight, err = sub.setInitialBlockHeight(ctx)
} else {
} else if !sub.inCatchupMode() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were constantly restoring back to the checkpoint (and logging that fact) in catchup mode. No need

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this took me a bit of time to understand, so adding a note here for my future self. The checkpoint is updated in catchup mode as part of batchComplete() call: https://github.com/kaleido-io/firefly-ethconnect/blob/511d55a563aa7311c310ecadd4aa3040760abead/internal/events/logprocessor.go#L75-L76

changed = changed || blockUpdatedFilterStale || i1 == nil || i1.Cmp(&i2) != 0
subChanged := i1 == nil || i1.Cmp(&i2) != 0
if subChanged {
log.Debugf("%s: New checkpoint HWM: %s", a.spec.ID, i2.String())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced other less helpful logging, to just show the one subscription that updated its checkpoint

subID: subID,
event: event,
stream: stream,
}
lp.highestDispatched.SetInt64(-1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Key fix - start with highestDispatched at -1 so compare to 0 works

lp.blockHWM.Set(blockNumber)
log.Debugf("%s: HWM: %s", lp.subID, lp.blockHWM.String())
// Nothing in-flight, its safe to update the HWM - to one after the block we're up to
lp.blockHWM.Set(new(big.Int).Add(blockNumber, big.NewInt(1)))
Copy link
Contributor Author

@peterbroadhurst peterbroadhurst Feb 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this, each restart we were going back one block.

Note very important we do new(big.Int) here, so we are not adding to the blockNumber passed in - which is a pointer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, definitely missed this the last time I worked on fixing the restart!

@@ -122,6 +122,10 @@ func NewSubscriptionManager(conf *SubscriptionManagerConf, rpc eth.RPCClient, cr
if conf.CatchupModePageSize <= 0 {
conf.CatchupModePageSize = defaultCatchupModePageSize
}
if conf.CatchupModeBlockGap < conf.CatchupModePageSize {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Key fix - stop users from setting invalid combination

@@ -410,7 +414,7 @@ func (s *subscriptionMGR) loadCheckpoint(streamID string) (map[string]*big.Int,
func (s *subscriptionMGR) storeCheckpoint(streamID string, checkpoint map[string]*big.Int) error {
cpID := checkpointIDPrefix + streamID
b, _ := json.MarshalIndent(&checkpoint, "", " ")
log.Debugf("Storing checkpoint %s: %s", cpID, string(b))
log.Tracef("Storing checkpoint %s: %s", cpID, string(b))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to trace as we now log at Debug the individual subs above, only when they change

Copy link
Contributor

@vdamle vdamle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with reverting the previous iteration of attempted fix, @peterbroadhurst . Changes LGTM

@vdamle vdamle merged commit f67b840 into hyperledger:main Feb 14, 2022
@vdamle vdamle deleted the block-gap branch February 14, 2022 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants