Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

specified neither primary_conninfo nor restore_command on restore from scratch #803

Open
Hannsre opened this issue Nov 11, 2024 · 5 comments
Assignees

Comments

@Hannsre
Copy link

Hannsre commented Nov 11, 2024

Hey everyone,

sorry to be back yet again. This is more of a report though, as I already got it working. But since it's unpredictable behavior I'd like to let you know.

I destroyed my previous cluster to rebuild it from scratch using a backup. This eventually works, but it took me a total of 4 runs to get it fully restored and started - without changing any settings.

First I got this on node 3, then after another try on node 2, both are replicas:

2024-11-11 13:34:06.748 UTC [38313] LOG:  restored log file "000000020000000000000021" from archive
2024-11-11 13:34:06.751 UTC [38313] LOG:  consistent recovery state reached at 0/1E000138
2024-11-11 13:34:06.751 UTC [38313] LOG:  recovery stopping after reaching consistency
2024-11-11 13:34:06.752 UTC [38310] LOG:  shutdown at recovery target
2024-11-11 13:34:06.753 UTC [38311] LOG:  shutting down
2024-11-11 13:34:06.755 UTC [38310] LOG:  database system is shut down
 stopped waiting
pg_ctl: could not start server
Examine the log output.

This was during name: Wait for the PostgreSQL start command to complete

I could mitigate it by running sudo -u postgres /usr/lib/postgresql/16/bin/pg_ctl start -D /var/lib/postgresql/16/main -o '-c hot_standby=off' . After that I got server started, but the playbook already failed so I had to start over.

There was nothing else in the logs so there's nothing else to post. Ansible failed due to the exit code not being 0.

On my third run this worked fine, but another issue on the leader came up:

2024-11-11 14:12:09.731 UTC [58328] LOG:  listening on IPv6 address "::1", port 5432
2024-11-11 14:12:09.731 UTC [58328] LOG:  listening on IPv4 address "127.0.0.1", port 5432
2024-11-11 14:12:09.731 UTC [58328] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2024-11-11 14:12:09.734 UTC [58331] LOG:  database system was shut down at 2024-11-11 13:52:02 UTC
2024-11-11 14:12:09.734 UTC [58331] WARNING:  specified neither primary_conninfo nor restore_command
2024-11-11 14:12:09.734 UTC [58331] HINT:  The database server will regularly poll the pg_wal subdirectory to check for files placed there.
2024-11-11 14:12:09.734 UTC [58331] LOG:  entering standby mode
2024-11-11 14:12:09.734 UTC [58331] LOG:  invalid resource manager ID in checkpoint record
2024-11-11 14:12:09.734 UTC [58331] PANIC:  could not locate a valid checkpoint record
2024-11-11 14:12:09.735 UTC [58328] LOG:  startup process (PID 58331) was terminated by signal 6: Aborted
2024-11-11 14:12:09.735 UTC [58328] LOG:  aborting startup due to startup process failure
2024-11-11 14:12:09.735 UTC [58328] LOG:  database system is shut down
 stopped waiting
pg_ctl: could not start server
Examine the log output.

I guess the config wasn't properly populated/passed to the restore command?
Ansible would be stuck at waiting for port 8008 to become available.

My "fix" was to add

recovery_target_action = 'shutdown'
restore_command = '/usr/bin/pgbackrest --stanza=test-stanza archive-get %f "%p"'
recovery_target = 'immediate'
recovery_target_timeline = 'current'

at the end of /var/lib/postgresql/16/main/postgresql.conf, then run sudo -u postgres /usr/lib/postgresql/16/bin/pg_ctl start -D /var/lib/postgresql/16/main -o '-c hot_standby=off' -w -t 180 .

This led to

2024-11-11 14:13:55.272 UTC [59310] LOG:  restored log file "00000003000000000000001E" from archive
2024-11-11 14:13:55.276 UTC [59310] LOG:  invalid resource manager ID in checkpoint record
2024-11-11 14:13:55.276 UTC [59310] PANIC:  could not locate a valid checkpoint record
2024-11-11 14:13:55.277 UTC [59307] LOG:  startup process (PID 59310) was terminated by signal 6: Aborted
2024-11-11 14:13:55.277 UTC [59307] LOG:  aborting startup due to startup process failure
2024-11-11 14:13:55.277 UTC [59307] LOG:  database system is shut down
 stopped waiting
pg_ctl: could not start server
Examine the log output.

So at least some progress? At this point I decided to run the playbook again because the steps above that worked on the replicas did not work here.

That 4. run now just finished without issues and the cluster was back up, our app is also working and not complaining.

Config is basically the same as in #770 , I only changed the ansible command to ansible-playbook deploy_pgcluster.yml -e '@pgcn.yml' --ask-vault-pass -v

Here's what I set for pgbackrest in the vars/main.yml

# https://patroni.readthedocs.io/en/latest/replica_bootstrap.html#bootstrap
patroni_cluster_bootstrap_method: "pgbackrest"  # or "wal-g", "pgbackrest", "pg_probackup"

# https://patroni.readthedocs.io/en/latest/replica_bootstrap.html#building-replicas
patroni_create_replica_methods:
  - pgbackrest
#  - wal_g
#  - pg_probackup
  - basebackup

pgbackrest:
  - { option: "command", value: "{{ pgbackrest_patroni_cluster_restore_command }}" }
  - { option: "keep_data", value: "True" }
  - { option: "no_params", value: "True" }
wal_g:
  - { option: "command", value: "{{ wal_g_patroni_cluster_bootstrap_command }}" }
  - { option: "no_params", value: "True" }
basebackup:
  - { option: "max-rate", value: "1000M" }
  - { option: "checkpoint", value: "fast" }
#  - { option: "waldir", value: "{{ postgresql_wal_dir }}" }
pg_probackup:
  - { option: "command", value: "{{ pg_probackup_restore_command }}" }
  - { option: "no_params", value: "true" }

# "restore_command" written to recovery.conf when configuring follower (create replica)
# postgresql_restore_command: ""
# postgresql_restore_command: "{{ wal_g_path }} wal-fetch %f %p"  # restore WAL-s using WAL-G
postgresql_restore_command: "/usr/bin/pgbackrest --stanza={{ pgbackrest_stanza }} archive-get %f %p"  # restore WAL-s using pgbackrest

Let me know if you need any more information to debug this.

@vitabaks
Copy link
Owner

@Hannsre Thanks.

Please attach /tmp/pg_recovery_<date>.log files.

You can also try running the playbook in debug mode (-v or -vvv) to get more details.

@Hannsre
Copy link
Author

Hannsre commented Nov 12, 2024

Hey,

sure, here's the log, it's the same on both replicas, despite the timestamps:

waiting for server to start....2024-11-11 13:34:06.086 UTC [38310] LOG:  starting PostgreSQL 16.4 (Debian 16.4-1.pgdg120+2) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
2024-11-11 13:34:06.086 UTC [38310] LOG:  listening on IPv6 address "::1", port 5432
2024-11-11 13:34:06.086 UTC [38310] LOG:  listening on IPv4 address "127.0.0.1", port 5432
2024-11-11 13:34:06.087 UTC [38310] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2024-11-11 13:34:06.089 UTC [38313] LOG:  database system was shut down in recovery at 2024-11-11 13:31:24 UTC
2024-11-11 13:34:06.089 UTC [38313] LOG:  starting point-in-time recovery to earliest consistent point
2024-11-11 13:34:06.094 P00   INFO: archive-get command begin 2.54.0: [00000002.history, pg_wal/RECOVERYHISTORY] --archive-async --archive-get-queue-max=1GiB --exec-id=38315-51b721fd --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/postgresql/16/main --process-max=4 --repo1-path=/pgbackrest --repo1-s3-bucket=test --repo1-s3-endpoint=minio.test/bucket --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=sip-backup-01 --repo1-s3-uri-style=path --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=test-stanza
2024-11-11 13:34:06.112 P00   INFO: found 00000002.history in the repo1: 16-1 archive
2024-11-11 13:34:06.112 P00   INFO: archive-get command end: completed successfully (19ms)
2024-11-11 13:34:06.113 UTC [38313] LOG:  restored log file "00000002.history" from archive
2024-11-11 13:34:06.119 P00   INFO: archive-get command begin 2.54.0: [00000002000000000000001E, pg_wal/RECOVERYXLOG] --archive-async --archive-get-queue-max=1GiB --exec-id=38317-f44b2a19 --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/postgresql/16/main --process-max=4 --repo1-path=/pgbackrest --repo1-s3-bucket=test --repo1-s3-endpoint=minio.test/bucket --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=sip-backup-01 --repo1-s3-uri-style=path --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=test-stanza
2024-11-11 13:34:06.722 P00   INFO: found 00000002000000000000001E in the archive asynchronously
2024-11-11 13:34:06.722 P00   INFO: archive-get command end: completed successfully (604ms)
2024-11-11 13:34:06.723 UTC [38313] LOG:  restored log file "00000002000000000000001E" from archive
2024-11-11 13:34:06.727 UTC [38313] LOG:  redo starts at 0/1E000028
2024-11-11 13:34:06.732 P00   INFO: archive-get command begin 2.54.0: [00000002000000000000001F, pg_wal/RECOVERYXLOG] --archive-async --archive-get-queue-max=1GiB --exec-id=38326-c8402c23 --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/postgresql/16/main --process-max=4 --repo1-path=/pgbackrest --repo1-s3-bucket=test --repo1-s3-endpoint=minio.test/bucket --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=sip-backup-01 --repo1-s3-uri-style=path --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=test-stanza
2024-11-11 13:34:06.732 P00   INFO: found 00000002000000000000001F in the archive asynchronously
2024-11-11 13:34:06.733 P00   INFO: archive-get command end: completed successfully (2ms)
2024-11-11 13:34:06.733 UTC [38313] LOG:  restored log file "00000002000000000000001F" from archive
2024-11-11 13:34:06.740 P00   INFO: archive-get command begin 2.54.0: [000000020000000000000020, pg_wal/RECOVERYXLOG] --archive-async --archive-get-queue-max=1GiB --exec-id=38330-f8e33a0d --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/postgresql/16/main --process-max=4 --repo1-path=/pgbackrest --repo1-s3-bucket=test --repo1-s3-endpoint=minio.test/bucket --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=sip-backup-01 --repo1-s3-uri-style=path --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=test-stanza
2024-11-11 13:34:06.740 P00   INFO: found 000000020000000000000020 in the archive asynchronously
2024-11-11 13:34:06.740 P00   INFO: archive-get command end: completed successfully (1ms)
2024-11-11 13:34:06.741 UTC [38313] LOG:  restored log file "000000020000000000000020" from archive
2024-11-11 13:34:06.748 P00   INFO: archive-get command begin 2.54.0: [000000020000000000000021, pg_wal/RECOVERYXLOG] --archive-async --archive-get-queue-max=1GiB --exec-id=38332-382c40c4 --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/postgresql/16/main --process-max=4 --repo1-path=/pgbackrest --repo1-s3-bucket=test --repo1-s3-endpoint=minio.test/bucket --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=sip-backup-01 --repo1-s3-uri-style=path --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=test-stanza
2024-11-11 13:34:06.748 P00   INFO: found 000000020000000000000021 in the archive asynchronously
2024-11-11 13:34:06.748 P00   INFO: archive-get command end: completed successfully (2ms)
2024-11-11 13:34:06.748 UTC [38313] LOG:  restored log file "000000020000000000000021" from archive
2024-11-11 13:34:06.751 UTC [38313] LOG:  consistent recovery state reached at 0/1E000138
2024-11-11 13:34:06.751 UTC [38313] LOG:  recovery stopping after reaching consistency
2024-11-11 13:34:06.752 UTC [38310] LOG:  shutdown at recovery target
2024-11-11 13:34:06.753 UTC [38311] LOG:  shutting down
2024-11-11 13:34:06.755 UTC [38310] LOG:  database system is shut down
 stopped waiting
pg_ctl: could not start server
Examine the log output.

Note: I replaced the S3 Endpoint and Stanza.

And on the Leader:

2024-11-11 14:12:09.731 UTC [58328] LOG:  listening on IPv6 address "::1", port 5432
2024-11-11 14:12:09.731 UTC [58328] LOG:  listening on IPv4 address "127.0.0.1", port 5432
2024-11-11 14:12:09.731 UTC [58328] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2024-11-11 14:12:09.734 UTC [58331] LOG:  database system was shut down at 2024-11-11 13:52:02 UTC
2024-11-11 14:12:09.734 UTC [58331] WARNING:  specified neither primary_conninfo nor restore_command
2024-11-11 14:12:09.734 UTC [58331] HINT:  The database server will regularly poll the pg_wal subdirectory to check for files placed there.
2024-11-11 14:12:09.734 UTC [58331] LOG:  entering standby mode
2024-11-11 14:12:09.734 UTC [58331] LOG:  invalid resource manager ID in checkpoint record
2024-11-11 14:12:09.734 UTC [58331] PANIC:  could not locate a valid checkpoint record
2024-11-11 14:12:09.735 UTC [58328] LOG:  startup process (PID 58331) was terminated by signal 6: Aborted
2024-11-11 14:12:09.735 UTC [58328] LOG:  aborting startup due to startup process failure
2024-11-11 14:12:09.735 UTC [58328] LOG:  database system is shut down
 stopped waiting
pg_ctl: could not start server

As well as

waiting for server to start....2024-11-11 14:13:40.993 UTC [59165] LOG:  starting PostgreSQL 16.4 (Debian 16.4-1.pgdg120+2) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
2024-11-11 14:13:40.993 UTC [59165] LOG:  listening on IPv6 address "::1", port 5432
2024-11-11 14:13:40.993 UTC [59165] LOG:  listening on IPv4 address "127.0.0.1", port 5432
2024-11-11 14:13:40.994 UTC [59165] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2024-11-11 14:13:40.996 UTC [59168] LOG:  database system was shut down at 2024-11-11 13:52:02 UTC
2024-11-11 14:13:40.996 UTC [59168] LOG:  entering standby mode
2024-11-11 14:13:41.001 P00   INFO: archive-get command begin 2.54.0: [00000003.history, pg_wal/RECOVERYHISTORY] --archive-async --archive-get-queue-max=1GiB --exec-id=59170-489ccfba --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/postgresql/16/main --process-max=4 --repo1-path=/pgbackrest --repo1-s3-bucket=bucket --repo1-s3-endpoint=minio.test/bucket --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=sip-backup-01 --repo1-s3-uri-style=path --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=sip-psql-dev
2024-11-11 14:13:41.021 P00   INFO: unable to find 00000003.history in the archive
2024-11-11 14:13:41.021 P00   INFO: archive-get command end: completed successfully (22ms)
2024-11-11 14:13:41.026 P00   INFO: archive-get command begin 2.54.0: [00000003000000000000001E, pg_wal/RECOVERYXLOG] --archive-async --archive-get-queue-max=1GiB --exec-id=59172-b7f3b91a --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/postgresql/16/main --process-max=4 --repo1-path=/pgbackrest --repo1-s3-bucket=bucket --repo1-s3-endpoint=minio.test/bucket --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=sip-backup-01 --repo1-s3-uri-style=path --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=sip-psql-dev
2024-11-11 14:13:41.628 P00   INFO: found 00000003000000000000001E in the archive asynchronously
2024-11-11 14:13:41.628 P00   INFO: archive-get command end: completed successfully (603ms)
2024-11-11 14:13:41.629 UTC [59168] LOG:  restored log file "00000003000000000000001E" from archive
2024-11-11 14:13:41.634 UTC [59168] LOG:  invalid resource manager ID in checkpoint record
2024-11-11 14:13:41.634 UTC [59168] PANIC:  could not locate a valid checkpoint record
2024-11-11 14:13:41.634 UTC [59165] LOG:  startup process (PID 59168) was terminated by signal 6: Aborted
2024-11-11 14:13:41.634 UTC [59165] LOG:  aborting startup due to startup process failure
2024-11-11 14:13:41.635 UTC [59165] LOG:  database system is shut down
 stopped waiting
pg_ctl: could not start server

I ran ansible with single verbosity -v, but didn't save the output unfortunately. But it basically just said it stopped because the previous command/log had exited with a non-0 Exit code, which was the pg_recovery_2024-11-11.log I posted above.

Edit/Addition:
While doing some failover testing I also noticed this in the postgresql log, because the former leader node did not come back online after a forced failure (abruptly killing the VM in this case).

2024-11-12 12:54:06 UTC [8003-1]  FATAL:  could not start WAL streaming: ERROR:  requested timeline 5 is not in this server's history
2024-11-12 12:54:06.705 P00   INFO: archive-get command begin 2.54.0: [00000008.history, pg_wal/RECOVERYHISTORY] --archive-async --archive-get-queue-max=1GiB --exec-id=8013-fc40b444 --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/postgresql/16/main --process-max=4 --repo1-path=/pgbackrest --repo1-s3-bucket=test --repo1-s3-endpoint=minio-backup.scootershop.dev/test --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=sip-backup-01 --repo1-s3-uri-style=path --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=sip-psql-dev
2024-11-12 12:54:06.740 P00   INFO: unable to find 00000008.history in the archive
2024-11-12 12:54:06.740 P00   INFO: archive-get command end: completed successfully (37ms)
2024-11-12 12:54:06 UTC [5182-40]  LOG:  waiting for WAL to become available at 0/31000040
2024-11-12 12:54:06 UTC [8015-1] [local] postgres@postgres FATAL:  the database system is starting up
2024-11-12 12:54:06 UTC [8017-1] [local] postgres@postgres FATAL:  the database system is starting up

Both initial replicas are fine.

Another Update:

On another restore run it worked fine until it tried to bring the leader up after restoring. This is from the postgresql-Tue.log

2024-11-12 13:51:14.738 P00   INFO: archive-get command begin 2.54.0: [00000004.history, pg_wal/RECOVERYHISTORY] --archive-async --archive-get-queue-max=1GiB --exec-id=41867-3ac485e4 --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/postgresql/16/main --process-max=4 --repo1-path=/pgbackrest --repo1-s3-bucket=bucket --repo1-s3-endpoint=minio.test/bucket --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=sip-backup-01 --repo1-s3-uri-style=path --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=sip-psql-dev
2024-11-12 13:51:14.758 P00   INFO: found 00000004.history in the repo1: 16-1 archive
2024-11-12 13:51:14.758 P00   INFO: archive-get command end: completed successfully (22ms)
2024-11-12 13:51:14 UTC [41865-2]  LOG:  restored log file "00000004.history" from archive
2024-11-12 13:51:14.764 P00   INFO: archive-get command begin 2.54.0: [00000005.history, pg_wal/RECOVERYHISTORY] --archive-async --archive-get-queue-max=1GiB --exec-id=41869-f5d57f3c --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/postgresql/16/main --process-max=4 --repo1-path=/pgbackrest --repo1-s3-bucket=bucket --repo1-s3-endpoint=minio.test/bucket --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=sip-backup-01 --repo1-s3-uri-style=path --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=sip-psql-dev
2024-11-12 13:51:14.783 P00   INFO: unable to find 00000005.history in the archive
2024-11-12 13:51:14.783 P00   INFO: archive-get command end: completed successfully (20ms)
2024-11-12 13:51:14 UTC [41865-3]  LOG:  entering standby mode
2024-11-12 13:51:14.789 P00   INFO: archive-get command begin 2.54.0: [00000004.history, pg_wal/RECOVERYHISTORY] --archive-async --archive-get-queue-max=1GiB --exec-id=41871-d25a7a5e --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/postgresql/16/main --process-max=4 --repo1-path=/pgbackrest --repo1-s3-bucket=bucket --repo1-s3-endpoint=minio.test/bucket --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=sip-backup-01 --repo1-s3-uri-style=path --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=sip-psql-dev
2024-11-12 13:51:14.809 P00   INFO: found 00000004.history in the repo1: 16-1 archive
2024-11-12 13:51:14.809 P00   INFO: archive-get command end: completed successfully (21ms)
2024-11-12 13:51:14 UTC [41865-4]  LOG:  restored log file "00000004.history" from archive
2024-11-12 13:51:14.816 P00   INFO: archive-get command begin 2.54.0: [00000003000000000000001E, pg_wal/RECOVERYXLOG] --archive-async --archive-get-queue-max=1GiB --exec-id=41873-59555ee7 --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/postgresql/16/main --process-max=4 --repo1-path=/pgbackrest --repo1-s3-bucket=bucket --repo1-s3-endpoint=minio.test/bucket --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=sip-backup-01 --repo1-s3-uri-style=path --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=sip-psql-dev
2024-11-12 13:51:14 UTC [41880-1] 192.168.88.221(38122) dbu1_scp_dev@db_scp_dev FATAL:  the database system is starting up
2024-11-12 13:51:15.418 P00   INFO: found 00000003000000000000001E in the archive asynchronously
2024-11-12 13:51:15.418 P00   INFO: archive-get command end: completed successfully (603ms)
2024-11-12 13:51:15 UTC [41865-5]  LOG:  restored log file "00000003000000000000001E" from archive
2024-11-12 13:51:15 UTC [41865-6]  LOG:  invalid resource manager ID in checkpoint record
2024-11-12 13:51:15 UTC [41865-7]  PANIC:  could not locate a valid checkpoint record
2024-11-12 13:51:15 UTC [41855-6]  LOG:  startup process (PID 41865) was terminated by signal 6: Aborted
2024-11-12 13:51:15 UTC [41855-7]  LOG:  aborting startup due to startup process failure
2024-11-12 13:51:15 UTC [41855-8]  LOG:  database system is shut down

and in ansible:

TASK [patroni : Wait for port 8008 to become open on the host] ***************************************************************************************************************************************************************************************************
ok: [192.168.88.101] => {"changed": false, "elapsed": 10, "match_groupdict": {}, "match_groups": [], "path": null, "port": 8008, "search_regex": null, "state": "started"}
FAILED - RETRYING: [192.168.88.101]: Check PostgreSQL is started and accepting connections (1000 retries left).
FAILED - RETRYING: [192.168.88.101]: Check PostgreSQL is started and accepting connections (999 retries left).
FAILED - RETRYING: [192.168.88.101]: Check PostgreSQL is started and accepting connections (998 retries left).
FAILED - RETRYING: [192.168.88.101]: Check PostgreSQL is started and accepting connections (997 retries left).
FAILED - RETRYING: [192.168.88.101]: Check PostgreSQL is started and accepting connections (996 retries left).
FAILED - RETRYING: [192.168.88.101]: Check PostgreSQL is started and accepting connections (995 retries left).

and it keeps on counting

@vitabaks vitabaks self-assigned this Nov 13, 2024
@vitabaks
Copy link
Owner

Hi @Hannsre Here is the fix - #820, please test it.

@Hannsre
Copy link
Author

Hannsre commented Nov 28, 2024

Thanks!

I just tried a PITR and it worked perfectly fine! No error, all Replicas and the Leader came back up fine without any issues/error.

I also tried restoring from a backup after destroying the cluster and it still failed.
I tried to manually start the Leader by stopping postgresql and running sudo -u postgres /usr/lib/postgresql/16/bin/pg_ctl start -D /var/lib/postgresql/16/main -o '-c hot_standby=off' again, but it lead to the same error.

Here's the logs regarding restore:

Ansible status:

TASK [patroni : Wait for port 8008 to become open on the host] ***************************************************************************************************************************************************************************************************
ok: [192.168.88.101]
FAILED - RETRYING: [192.168.88.101]: Check PostgreSQL is started and accepting connections (1000 retries left).
FAILED - RETRYING: [192.168.88.101]: Check PostgreSQL is started and accepting connections (999 retries left).
FAILED - RETRYING: [192.168.88.101]: Check PostgreSQL is started and accepting connections (998 retries left).
FAILED - RETRYING: [192.168.88.101]: Check PostgreSQL is started and accepting connections (997 retries left).
FAILED - RETRYING: [192.168.88.101]: Check PostgreSQL is started and accepting connections (996 retries left).
FAILED - RETRYING: [192.168.88.101]: Check PostgreSQL is started and accepting connections (995 retries left).

/var/log/postgresql/postgresql-Thu.log:

waiting for server to start....2024-11-28 11:14:05.628 UTC [44150] LOG:  starting PostgreSQL 16.6 (Debian 16.6-1.pgdg120+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
2024-11-28 11:14:05.628 UTC [44150] LOG:  listening on IPv6 address "::1", port 5432
2024-11-28 11:14:05.628 UTC [44150] LOG:  listening on IPv4 address "127.0.0.1", port 5432
2024-11-28 11:14:05.628 UTC [44150] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2024-11-28 11:14:05.630 UTC [44153] LOG:  database system was shut down at 2024-11-28 11:06:21 UTC
2024-11-28 11:14:05.631 UTC [44153] WARNING:  specified neither primary_conninfo nor restore_command
2024-11-28 11:14:05.631 UTC [44153] HINT:  The database server will regularly poll the pg_wal subdirectory to check for files placed there.
2024-11-28 11:14:05.631 UTC [44153] LOG:  entering standby mode
2024-11-28 11:14:05.631 UTC [44153] LOG:  invalid resource manager ID in checkpoint record
2024-11-28 11:14:05.631 UTC [44153] PANIC:  could not locate a valid checkpoint record
2024-11-28 11:14:05.631 UTC [44150] LOG:  startup process (PID 44153) was terminated by signal 6: Aborted
2024-11-28 11:14:05.631 UTC [44150] LOG:  aborting startup due to startup process failure
2024-11-28 11:14:05.631 UTC [44150] LOG:  database system is shut down
 stopped waiting
pg_ctl: could not start server
Examine the log output.

/tmp/pg_recovery_2024-11-28.log:

2024-11-28 11:06:18.470 P00   INFO: unable to find 00000003.history in the archive
2024-11-28 11:06:18.470 P00   INFO: archive-get command end: completed successfully (21ms)
2024-11-28 11:06:18.471 UTC [39430] LOG:  selected new timeline ID: 3
2024-11-28 11:06:18.502 P00   INFO: archive-get command begin 2.54.0: [00000002.history, pg_wal/RECOVERYHISTORY] --archive-async --archive-get-queue-max=1GiB --exec-id=39470-1659a27e --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/postgresql/16/main --process-max=4 --repo1-path=/pgbackrest --repo1-s3-bucket=backup-bucket --repo1-s3-endpoint=minio.example.com/backup-bucket --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=backup-01 --repo1-s3-uri-style=path --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=test-stanza
2024-11-28 11:06:18.524 P00   INFO: found 00000002.history in the repo1: 16-1 archive
2024-11-28 11:06:18.524 P00   INFO: archive-get command end: completed successfully (23ms)
2024-11-28 11:06:18.525 UTC [39430] LOG:  restored log file "00000002.history" from archive
2024-11-28 11:06:18.527 UTC [39430] LOG:  archive recovery complete
2024-11-28 11:06:18.528 UTC [39428] LOG:  checkpoint starting: end-of-recovery immediate wait
2024-11-28 11:06:18.532 UTC [39428] LOG:  checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.002 s, sync=0.001 s, total=0.005 s; sync files=2, longest=0.001 s, average=0.001 s; distance=0 kB, estimate=0 kB; lsn=0/1E000138, redo lsn=0/1E000138
2024-11-28 11:06:18.537 UTC [39427] LOG:  database system is ready to accept connections
2024-11-28 11:06:21.673 UTC [39427] LOG:  received fast shutdown request
2024-11-28 11:06:21.674 UTC [39427] LOG:  aborting any active transactions
2024-11-28 11:06:21.675 UTC [39427] LOG:  background worker "logical replication launcher" (PID 39473) exited with exit code 1
2024-11-28 11:06:21.676 UTC [39428] LOG:  shutting down
2024-11-28 11:06:21.676 UTC [39428] LOG:  checkpoint starting: shutdown immediate
2024-11-28 11:06:21.677 UTC [39428] LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.000 s, total=0.002 s; sync files=0, longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0 kB; lsn=0/1E0001E8, redo lsn=0/1E0001E8
2024-11-28 11:06:21.679 UTC [39427] LOG:  database system is shut down

Patroni was in a failed state, or at least reported an error. Manually restarting patroni did not change this state:

Nov 28 11:09:50 pgcn01 patroni[41648]: 2024-11-28 11:09:50,544 INFO: establishing a new patroni heartbeat connection to postgres
Nov 28 11:09:50 pgcn01 patroni[41648]: 2024-11-28 11:09:50,612 INFO: establishing a new patroni heartbeat connection to postgres
Nov 28 11:09:50 pgcn01 patroni[41648]: 2024-11-28 11:09:50,673 INFO: postmaster pid=41668
Nov 28 11:09:50 pgcn01 patroni[41669]: /var/run/postgresql:5432 - no response
Nov 28 11:09:50 pgcn01 patroni[41668]: 2024-11-28 11:09:50 UTC [41668-1]  LOG:  redirecting log output to logging collector process
Nov 28 11:09:50 pgcn01 patroni[41668]: 2024-11-28 11:09:50 UTC [41668-2]  HINT:  Future log output will appear in directory "/var/log/postgresql".
Nov 28 11:09:50 pgcn01 patroni[41648]: 2024-11-28 11:09:50,793 INFO: establishing a new patroni heartbeat connection to postgres
Nov 28 11:09:50 pgcn01 patroni[41648]: 2024-11-28 11:09:50,805 INFO: establishing a new patroni heartbeat connection to postgres
Nov 28 11:09:50 pgcn01 patroni[41648]: 2024-11-28 11:09:50,875 INFO: establishing a new patroni heartbeat connection to postgres
Nov 28 11:09:51 pgcn01 patroni[41648]: 2024-11-28 11:09:51,680 ERROR: postmaster is not running

Forgot to add: I cancelled the restore process and then ran a PITR on that now failed cluster and it completed succesfully. It is still not fully functional though, as I can only connect to the leader directly, every replica only gives me an error regarding being read-only, which is expected. Using the failover IP (.104 in my case) shows the same error.

So it's not routing the requests to the correct leader, if I'm not mistaken. I can't see any errors when checking pgbouncer, postgres, patroni or haproxy though.

It also doesn't matter whether I connect to postgresql (5432) directly or via pgbouncer (6432). It worked fine when starting from scratch though. So I'll also start fresh and retry the restore process to make sure no old config from before now messes with the setup.

@vitabaks
Copy link
Owner

cc @SDV109

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants