Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream state resets if the Jetstream store directory goes into read only mode [v2.10.23] #6211

Open
sudojha opened this issue Dec 4, 2024 · 3 comments · May be fixed by #6292
Open

Stream state resets if the Jetstream store directory goes into read only mode [v2.10.23] #6211

sudojha opened this issue Dec 4, 2024 · 3 comments · May be fixed by #6292
Labels
defect Suspected defect such as a bug or regression

Comments

@sudojha
Copy link

sudojha commented Dec 4, 2024

Observed behavior

Upon jetstream storage directory going into read only mode, the in memory stream state gets reset.

Screenshot 2024-12-04 at 11 36 17 AM

It takes a restart to restore the stream state.

Code inspection revealed the following:

For a stream with ttl enabled, expireMsgs is called periodically.

This internally calls removeMsgBlock. Once the last message block is removed, a new tombstone block is created and needs to be assigned to lmb. Since the store directory is in read only mode the error is returned before the block gets assigned to lmb.

This causes future publish to fail until file system recovers and the server is restarted.
Why does subsequent publish fail?
Once a message block is to be removed from the file system, dirtyCloseWithRemove is called. This method clears the required state of the message block (particularly mfn which stores the qualified file name of the message block on disk).
The in memory representation of message block is stored in filestore under blks. The lmb in filestore shares the same reference. Once the last block which is the lmb is removed, the state of lmb is cleared because file system is in read-only mode. Once this happens, future publish fails even if file system recovers with these errors:

Screenshot 2024-12-04 at 12 02 20 PM

Note that the mfn of lmb is empty hence the msg block file could not be opened.

Expected behavior

Ideally nats server should recover automatically once the file system recovers without needing a restart.

Server and client version

Server version : 2.10.11 running in standalone mode with jetstream enabled.
Was reproducible in 2.10.23-RC.7

Host environment

Mac os 14.4 (23E214) [Should be reproducible on any system)

Steps to reproduce

Create a stream with max age configured eg:
{ "name": "sko", "subjects": [ "sko" ], "retention": "workqueue", "max_consumers": -1, "max_msgs_per_subject": -1, "max_msgs": -1, "max_bytes": -1, "max_age": 10000000000, "max_msg_size": -1, "storage": "file", "discard": "new", "num_replicas": 1, "duplicate_window": 5000000000, "sealed": false, "deny_delete": false, "deny_purge": false, "allow_rollup_hdrs": false, "allow_direct": true, "mirror_direct": false, "consumer_limits": {} }

  1. Publish messages to stream
  2. Remove write access from jetstream store directory ( chmod -R u-w *)
  3. Stop publishing msgs
  4. Wait for ageCheck to kick in.

This will cause stream's first and last sequence to reset to 0.

Side Note

We are also observing a situation where stream state has been reset but the publish still works. ( Similar to #6159)
Consumer seq id here is way ahead of stream seq id (in millions). We are not able to reproduce this exact scenario. Any help in reproducing this would be appreciated.

@sudojha sudojha added the defect Suspected defect such as a bug or regression label Dec 4, 2024
@wallyqs
Copy link
Member

wallyqs commented Dec 4, 2024

@sudojha about the side note issue, do you have a setup using leafnodes or with streams being deleted and recreated?

@sudojha
Copy link
Author

sudojha commented Dec 4, 2024

@wallyqs i tried this locally. I don't have a setup with leaf nodes or with deletion and recreation of streams.

@pranavmehta94
Copy link

@wallyqs We have raised a draft PR for initial review and comments on the approach we propose to handle issues around stream state reset observed when filesystem enters read-only state.
Happy to build/modify the PR based on suggestions!

@wallyqs wallyqs changed the title Stream state resets if the Jetstream store directory goes into read only mode Stream state resets if the Jetstream store directory goes into read only mode [v2.10.23] Dec 11, 2024
@souravagrawal souravagrawal linked a pull request Dec 22, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants