You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Got an interesting case with one of the production read-only endpoints. Walreceiver errored out and died:
2024-06-18 14:09:16.961 {"app":"NeonVM","endpoint_id":"ep-winter-rice-59233042","pod":"compute-lingering-forest-a2yogi5o-6kzkf","_entry":"PG:2024-06-18 14:09:16.598 GMT ttid=a27b300c2ff46c602a1635ab92d236f3/03baf9167a86378faa6375d8273d0f6d sqlstate=53100 [493] FATAL: could not write to file \"pg_wal/xlogtemp.493\": No space left on device"}
2024-06-18 08:22:29.288 {"app":"NeonVM","endpoint_id":"ep-winter-rice-59233042","pod":"compute-lingering-forest-a2yogi5o-6kzkf","_entry":"PG:2024-06-18 08:22:29.127 GMT ttid=a27b300c2ff46c602a1635ab92d236f3/03baf9167a86378faa6375d8273d0f6d sqlstate=00000 [493] LOG: skipping missing configuration file \"/var/db/postgres/compute/pgdata/compute_ctl_temp_override.conf\""}
2024-06-18 08:22:24.446 {"app":"NeonVM","pod":"compute-lingering-forest-a2yogi5o-6kzkf","_entry":"PG:2024-06-18 08:22:24.347 GMT ttid=a27b300c2ff46c602a1635ab92d236f3/03baf9167a86378faa6375d8273d0f6d sqlstate=00000 [493] LOG: started streaming WAL from primary at 3/49000000 on timeline 1"}
Heikki suggested to try to manually reproduce by adding elog(FATAL, "crashme") in walreceiver.
Did it. But the problem is not reproduced: walreceiver is restarted.
Also please notice that in case of No space left on device error and write WAL failure, Postgres panics:
I wonder if there is any proof that walreceiver is actually died and not restarted?
As far as I understand symptoms are the following: we have active but lagged replica.
I wonder if walreceiver process is absent or there is some other proof that it failed to restart?
May be it is just locked or waits for something (from SK for example)?
I looked through postmaster code but didn't find some obvious explanation which can prevent crashed walreceiver from been restarted.
I failed to reproduce the problem by throwing FATAL exception in walreceiver (I tried different places and frequency).
May be it is somehow related of out-of-disk space which makes it not possible to spawn new process (for example it tries to allocate some file, failed and not spawned)? Frankly speaking I do not believe I this hypothesis because I expect that some error should be reported and present in Postgres log in this case.
Got an interesting case with one of the production read-only endpoints. Walreceiver errored out and died:
but then it did not start again.
https://neondb.slack.com/archives/C04DGM6SMTM/p1719394592373479
https://console.neon.tech/admin/regions/aws-eu-central-1/computes/compute-lingering-forest-a2yogi5o
Heikki suggested to try to manually reproduce by adding
elog(FATAL, "crashme")
in walreceiver.The text was updated successfully, but these errors were encountered: