You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This causes future publish to fail until file system recovers and the server is restarted.
Why does subsequent publish fail?
Once a message block is to be removed from the file system, dirtyCloseWithRemove is called. This method clears the required state of the message block (particularly mfn which stores the qualified file name of the message block on disk).
The in memory representation of message block is stored in filestore under blks. The lmb in filestore shares the same reference. Once the last block which is the lmb is removed, the state of lmb is cleared because file system is in read-only mode. Once this happens, future publish fails even if file system recovers with these errors:
Note that the mfn of lmb is empty hence the msg block file could not be opened.
Expected behavior
Ideally nats server should recover automatically once the file system recovers without needing a restart.
Server and client version
Server version : 2.10.11 running in standalone mode with jetstream enabled.
Was reproducible in 2.10.23-RC.7
Host environment
Mac os 14.4 (23E214) [Should be reproducible on any system)
Remove write access from jetstream store directory ( chmod -R u-w *)
Stop publishing msgs
Wait for ageCheck to kick in.
This will cause stream's first and last sequence to reset to 0.
Side Note
We are also observing a situation where stream state has been reset but the publish still works. ( Similar to #6159)
Consumer seq id here is way ahead of stream seq id (in millions). We are not able to reproduce this exact scenario. Any help in reproducing this would be appreciated.
The text was updated successfully, but these errors were encountered:
@wallyqs We have raised a draft PR for initial review and comments on the approach we propose to handle issues around stream state reset observed when filesystem enters read-only state.
Happy to build/modify the PR based on suggestions!
wallyqs
changed the title
Stream state resets if the Jetstream store directory goes into read only mode
Stream state resets if the Jetstream store directory goes into read only mode [v2.10.23]
Dec 11, 2024
Observed behavior
Upon jetstream storage directory going into read only mode, the in memory stream state gets reset.
It takes a restart to restore the stream state.
Code inspection revealed the following:
For a stream with ttl enabled, expireMsgs is called periodically.
This internally calls removeMsgBlock. Once the last message block is removed, a new tombstone block is created and needs to be assigned to lmb. Since the store directory is in read only mode the error is returned before the block gets assigned to lmb.
This causes future publish to fail until file system recovers and the server is restarted.
Why does subsequent publish fail?
Once a message block is to be removed from the file system, dirtyCloseWithRemove is called. This method clears the required state of the message block (particularly mfn which stores the qualified file name of the message block on disk).
The in memory representation of message block is stored in filestore under blks. The lmb in filestore shares the same reference. Once the last block which is the lmb is removed, the state of lmb is cleared because file system is in read-only mode. Once this happens, future publish fails even if file system recovers with these errors:
Note that the mfn of lmb is empty hence the msg block file could not be opened.
Expected behavior
Ideally nats server should recover automatically once the file system recovers without needing a restart.
Server and client version
Server version : 2.10.11 running in standalone mode with jetstream enabled.
Was reproducible in 2.10.23-RC.7
Host environment
Mac os 14.4 (23E214) [Should be reproducible on any system)
Steps to reproduce
Create a stream with max age configured eg:
{ "name": "sko", "subjects": [ "sko" ], "retention": "workqueue", "max_consumers": -1, "max_msgs_per_subject": -1, "max_msgs": -1, "max_bytes": -1, "max_age": 10000000000, "max_msg_size": -1, "storage": "file", "discard": "new", "num_replicas": 1, "duplicate_window": 5000000000, "sealed": false, "deny_delete": false, "deny_purge": false, "allow_rollup_hdrs": false, "allow_direct": true, "mirror_direct": false, "consumer_limits": {} }
This will cause stream's first and last sequence to reset to 0.
Side Note
We are also observing a situation where stream state has been reset but the publish still works. ( Similar to #6159)
Consumer seq id here is way ahead of stream seq id (in millions). We are not able to reproduce this exact scenario. Any help in reproducing this would be appreciated.
The text was updated successfully, but these errors were encountered: