Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: walreceiver did not restart after erroring out #8172

Open
Tracked by #6211
kelvich opened this issue Jun 26, 2024 · 4 comments
Open
Tracked by #6211

Bug: walreceiver did not restart after erroring out #8172

kelvich opened this issue Jun 26, 2024 · 4 comments
Assignees
Labels
c/compute Component: compute, excluding postgres itself t/bug Issue Type: Bug

Comments

@kelvich
Copy link
Contributor

kelvich commented Jun 26, 2024

Got an interesting case with one of the production read-only endpoints. Walreceiver errored out and died:

2024-06-18 14:09:16.961	 {"app":"NeonVM","endpoint_id":"ep-winter-rice-59233042","pod":"compute-lingering-forest-a2yogi5o-6kzkf","_entry":"PG:2024-06-18 14:09:16.598 GMT ttid=a27b300c2ff46c602a1635ab92d236f3/03baf9167a86378faa6375d8273d0f6d sqlstate=53100 [493] FATAL:  could not write to file \"pg_wal/xlogtemp.493\": No space left on device"}
2024-06-18 08:22:29.288	{"app":"NeonVM","endpoint_id":"ep-winter-rice-59233042","pod":"compute-lingering-forest-a2yogi5o-6kzkf","_entry":"PG:2024-06-18 08:22:29.127 GMT ttid=a27b300c2ff46c602a1635ab92d236f3/03baf9167a86378faa6375d8273d0f6d sqlstate=00000 [493] LOG:  skipping missing configuration file \"/var/db/postgres/compute/pgdata/compute_ctl_temp_override.conf\""}
2024-06-18 08:22:24.446	{"app":"NeonVM","pod":"compute-lingering-forest-a2yogi5o-6kzkf","_entry":"PG:2024-06-18 08:22:24.347 GMT ttid=a27b300c2ff46c602a1635ab92d236f3/03baf9167a86378faa6375d8273d0f6d sqlstate=00000 [493] LOG:  started streaming WAL from primary at 3/49000000 on timeline 1"}

but then it did not start again.

https://neondb.slack.com/archives/C04DGM6SMTM/p1719394592373479
https://console.neon.tech/admin/regions/aws-eu-central-1/computes/compute-lingering-forest-a2yogi5o

Heikki suggested to try to manually reproduce by adding elog(FATAL, "crashme") in walreceiver.

@kelvich kelvich added t/bug Issue Type: Bug c/compute Component: compute, excluding postgres itself labels Jun 26, 2024
@kelvich kelvich changed the title Bug: walreceiver did not restart after erroring our Bug: walreceiver did not restart after erroring out Jul 1, 2024
@knizhnik
Copy link
Contributor

knizhnik commented Jul 3, 2024

Heikki suggested to try to manually reproduce by adding elog(FATAL, "crashme") in walreceiver.

Did it. But the problem is not reproduced: walreceiver is restarted.
Also please notice that in case of No space left on device error and write WAL failure, Postgres panics:

			ereport(PANIC,
					(errcode_for_file_access(),
					 errmsg("could not write to WAL segment %s "
							"at offset %u, length %lu: %m",
							xlogfname, startoff, (unsigned long) segbytes)));

which should cause termination of the whole VM (not sure if k8s will restart).

@knizhnik
Copy link
Contributor

knizhnik commented Jul 4, 2024

I wonder if there is any proof that walreceiver is actually died and not restarted?
As far as I understand symptoms are the following: we have active but lagged replica.
I wonder if walreceiver process is absent or there is some other proof that it failed to restart?
May be it is just locked or waits for something (from SK for example)?

I looked through postmaster code but didn't find some obvious explanation which can prevent crashed walreceiver from been restarted.

@kelvich
Copy link
Contributor Author

kelvich commented Jul 4, 2024

i manually checked that there were no walreceiver running on replica, here is ps output https://neondb.slack.com/archives/C04DGM6SMTM/p1719401779142989?thread_ts=1719394592.373479&cid=C04DGM6SMTM

@knizhnik
Copy link
Contributor

knizhnik commented Jul 4, 2024

I failed to reproduce the problem by throwing FATAL exception in walreceiver (I tried different places and frequency).
May be it is somehow related of out-of-disk space which makes it not possible to spawn new process (for example it tries to allocate some file, failed and not spawned)? Frankly speaking I do not believe I this hypothesis because I expect that some error should be reported and present in Postgres log in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/compute Component: compute, excluding postgres itself t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

2 participants