Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Units stuck in reinitialising replica and awaiting for cluster to start #684

Open
kelkawi-a opened this issue Sep 4, 2024 · 5 comments
Open
Labels
bug Something isn't working

Comments

@kelkawi-a
Copy link

kelkawi-a commented Sep 4, 2024

Steps to reproduce

  1. Deploy 3 units of postgresql-k8s charm, channel 14/stable revision 281

Expected behavior

The units remain in an active state

Actual behavior

After running fine for a while (i.e. all three units were active and functional", two of the three units became stuck in a waiting/maintenance state with the following status:

postgresql-k8s/0                     active       idle              Primary
postgresql-k8s/1                     waiting      idle             awaiting for cluster to start
postgresql-k8s/2*                    maintenance  idle             reinitialising replica

Versions

Operating system: Ubuntu 22.04.4 LTS

Juju CLI: 3.5.3-ubuntu-amd64

Juju agent: 3.5.3

Charm revision: 281, channel 14/stable

kubectl:
Client Version: v1.30.4
Server Version: v1.26.15

Log output

Juju debug log:

Output of juju debug-log --include postgresql-k8s/<unit_number>:

postgresql-1.log
postgresql-2.log

Output of juju show-status-log of unit 1:

Time                   Type       Status       Message
03 Sep 2024 14:52:21Z  workload   active       Primary
03 Sep 2024 15:18:10Z  juju-unit  error        hook failed: "update-status"
03 Sep 2024 15:20:05Z  workload   maintenance  stopping charm software
03 Sep 2024 15:20:05Z  juju-unit  executing    running stop hook
03 Sep 2024 15:20:12Z  workload   maintenance  
03 Sep 2024 15:20:12Z  juju-unit  executing    running start hook
03 Sep 2024 15:20:16Z  juju-unit  executing    running leader-settings-changed hook
03 Sep 2024 15:20:17Z  juju-unit  executing    running postgresql-pebble-ready hook
03 Sep 2024 15:20:19Z  workload   maintenance  stopping charm software
03 Sep 2024 15:20:19Z  juju-unit  executing    running stop hook
03 Sep 2024 15:20:21Z  workload   maintenance  
03 Sep 2024 15:27:45Z  juju-unit  executing    running upgrade-charm hook
03 Sep 2024 15:27:59Z  juju-unit  executing    running config-changed hook
03 Sep 2024 15:28:01Z  juju-unit  executing    running start hook
03 Sep 2024 15:28:04Z  juju-unit  executing    running leader-settings-changed hook
03 Sep 2024 15:28:06Z  juju-unit  executing    running postgresql-pebble-ready hook
03 Sep 2024 22:22:38Z  juju-unit  idle         
03 Sep 2024 22:23:42Z  juju-unit  error        hook failed: "update-status"
04 Sep 2024 09:23:17Z  juju-unit  idle         
04 Sep 2024 09:23:17Z  workload   waiting      awaiting for cluster to start

Output of juju show-status-log of unit 2:

Time                   Type       Status       Message
03 Sep 2024 11:41:52Z  workload   maintenance  stopping charm software
03 Sep 2024 11:41:52Z  juju-unit  executing    running stop hook
03 Sep 2024 11:42:02Z  workload   maintenance  
03 Sep 2024 11:42:03Z  juju-unit  executing    running start hook
03 Sep 2024 11:49:32Z  juju-unit  error        hook failed: "start"
03 Sep 2024 11:49:38Z  juju-unit  executing    running start hook
03 Sep 2024 11:49:50Z  juju-unit  executing    running leader-settings-changed hook
03 Sep 2024 11:49:51Z  juju-unit  executing    running postgresql-pebble-ready hook
03 Sep 2024 11:50:45Z  workload   waiting      awaiting for cluster to start
03 Sep 2024 11:50:53Z  workload   waiting      Updating extensions
03 Sep 2024 11:50:53Z  workload   waiting      awaiting for cluster to start
03 Sep 2024 11:50:53Z  workload   active       
03 Sep 2024 14:52:10Z  juju-unit  idle         
03 Sep 2024 14:59:11Z  juju-unit  executing    running leader-elected hook
03 Sep 2024 15:31:18Z  workload   maintenance  reinitialising replica
03 Sep 2024 15:32:05Z  workload   active       
03 Sep 2024 22:23:01Z  juju-unit  idle         
03 Sep 2024 22:23:21Z  juju-unit  error        hook failed: "update-status"
04 Sep 2024 09:24:11Z  juju-unit  idle         
04 Sep 2024 09:24:11Z  workload   maintenance  reinitialising replica

Patroni logs:

Unit 1:

2024-09-04 09:43:37 UTC [16]: WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role 
2024-09-04 09:43:37 UTC [16]: INFO: no action. I am (postgresql-k8s-1), a secondary, and following a leader (postgresql-k8s-0) 
2024-09-04 09:43:36 UTC [16]: INFO: Lock owner: postgresql-k8s-0; I am postgresql-k8s-1 
2024-09-04 09:43:36 UTC [16]: INFO: Still starting up as a standby. 
2024-09-04 09:43:36 UTC [16]: INFO: Lock owner: postgresql-k8s-0; I am postgresql-k8s-1 
2024-09-04 09:43:36 UTC [16]: INFO: establishing a new patroni connection to the postgres cluster 
2024-09-04 09:43:37 UTC [16]: INFO: establishing a new patroni connection to the postgres cluster 
2024-09-04 09:43:37 UTC [16]: WARNING: Retry got exception: connection problems 
2024-09-04 09:43:26 UTC [16]: INFO: establishing a new patroni connection to the postgres cluster 
2024-09-04 09:43:26 UTC [16]: INFO: establishing a new patroni connection to the postgres cluster 
2024-09-04 09:43:26 UTC [16]: WARNING: Retry got exception: connection problems 
2024-09-04 09:43:26 UTC [16]: WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role 
2024-09-04 09:43:26 UTC [16]: INFO: no action. I am (postgresql-k8s-1), a secondary, and following a leader (postgresql-k8s-0) 

Unit 2:

2024-09-04 09:43:56 UTC [14833]: INFO: Lock owner: postgresql-k8s-0; I am postgresql-k8s-2 
2024-09-04 09:43:56 UTC [14833]: INFO: reinitialize in progress 
2024-09-04 09:44:06 UTC [14833]: INFO: Lock owner: postgresql-k8s-0; I am postgresql-k8s-2 
2024-09-04 09:44:06 UTC [14833]: INFO: reinitialize in progress 
2024-09-04 09:44:06 UTC [14833]: INFO: Lock owner: postgresql-k8s-0; I am postgresql-k8s-2 
2024-09-04 09:44:06 UTC [14833]: INFO: reinitialize in progress 
2024-09-04 09:43:55 UTC [14833]: ERROR: Could not rename data directory /var/lib/postgresql/data/pgdata 
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/patroni/postgresql/__init__.py", line 1314, in remove_data_directory
    shutil.rmtree(self._data_dir)
  File "/usr/lib/python3.10/shutil.py", line 731, in rmtree
    onerror(os.rmdir, path, sys.exc_info())
  File "/usr/lib/python3.10/shutil.py", line 729, in rmtree
    os.rmdir(path)
PermissionError: [Errno 13] Permission denied: '/var/lib/postgresql/data/pgdata'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/patroni/postgresql/__init__.py", line 1287, in move_data_directory
    os.rename(self._data_dir, new_name)
PermissionError: [Errno 13] Permission denied: '/var/lib/postgresql/data/pgdata' -> '/var/lib/postgresql/data/pgdata.failed'
2024-09-04 09:43:55 UTC [14833]: INFO: renaming data directory to /var/lib/postgresql/data/pgdata.failed 
@kelkawi-a kelkawi-a added the bug Something isn't working label Sep 4, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/DPE-5335.

This message was autogenerated

@marceloneppel
Copy link
Member

Hi, @kelkawi-a!

Do you know if the cluster was restarted or upgraded in some way? I see the following hook being fired on Unit 1 in the logs that you shared:

03 Sep 2024 15:27:45Z  juju-unit  executing    running upgrade-charm hook

Could you share some logs from Unit 1 so we can understand what's happening?

juju show-unit postgresql-k8s/1

juju ssh --container postgresql postgresql-k8s/1 pebble services
juju ssh --container charm postgresql-k8s/1 curl localhost:8008/cluster
juju ssh --container charm postgresql-k8s/0 curl localhost:8008/cluster
juju ssh --container charm postgresql-k8s/0 curl localhost:8008/history

juju ssh --container postgresql postgresql-k8s/1 cat /var/log/postgresql/patroni.log /var/log/postgresql/patroni.log.1 /var/log/postgresql/patroni.log.2

juju ssh --container postgresql postgresql-k8s/1 "find /var/log/postgresql/ -name postgresql*.log -not -empty -exec ls {} \; -exec cat {} \;"

If you're using TLS, you should use curl -k https://localhost:8008/xxx in the above commands.

The following error on Unit 2 has been fixed in revisions 332 and 333 from the 14/edge channel (#580) and will be part of the next revision on the 14/stable channel.

PermissionError: [Errno 13] Permission denied: '/var/lib/postgresql/data/pgdata'

Right now, to fix Unit 2, you can run the following command:

juju ssh --container postgresql postgresql-k8s/2 chown postgres:postgres /var/lib/postgresql/data

@kelkawi-a
Copy link
Author

@marceloneppel thanks for investigating. The cluster is not managed by our team so I don't have visibility on whether or not the cluster was restarted.

Below are the requested logs:

juju ssh --container postgresql postgresql-k8s/1 pebble services:

Service            Startup   Current   Since
metrics_server     enabled   active    2 days ago, at 15:27 UTC
pgbackrest server  disabled  inactive  -
postgresql         enabled   active    2 days ago, at 15:27 UTC

juju ssh --container charm postgresql-k8s/1 curl localhost:8008/cluster:

{"members": [{"name": "postgresql-k8s-0", "role": "leader", "state": "running", "api_url": "http://postgresql-k8s-0.postgresql-k8s-endpoints:8008/patroni", "host": "postgresql-k8s-0.postgresql-k8s-endpoints", "port": 5432, "timeline": 213}, {"name": "postgresql-k8s-1", "role": "replica", "state": "starting", "api_url": "http://postgresql-k8s-1.postgresql-k8s-endpoints:8008/patroni", "host": "postgresql-k8s-1.postgresql-k8s-endpoints", "port": 5432, "lag": "unknown"}]}

juju ssh --container charm postgresql-k8s/0 curl localhost:8008/cluster:

{"members": [{"name": "postgresql-k8s-0", "role": "leader", "state": "running", "api_url": "http://postgresql-k8s-0.postgresql-k8s-endpoints:8008/patroni", "host": "postgresql-k8s-0.postgresql-k8s-endpoints", "port": 5432, "timeline": 213}, {"name": "postgresql-k8s-1", "role": "replica", "state": "starting", "api_url": "http://postgresql-k8s-1.postgresql-k8s-endpoints:8008/patroni", "host": "postgresql-k8s-1.postgresql-k8s-endpoints", "port": 5432, "lag": "unknown"}]}

juju ssh --container charm postgresql-k8s/0 curl localhost:8008/history:

[[1, 181796616, "no recovery target specified", "2024-08-20T15:39:40.956551+00:00", "postgresql-k8s-1"], [2, 251658400, "no recovery target specified", "2024-08-20T16:34:22.832061+00:00", "postgresql-k8s-1"], [3, 268435616, "no recovery target specified", "2024-08-20T16:34:32.856157+00:00", "postgresql-k8s-1"], [4, 402653344, "no recovery target specified", "2024-08-20T18:56:24.021318+00:00", "postgresql-k8s-1"], [5, 514753096, "no recovery target specified", "2024-08-20T22:37:19.281941+00:00", "postgresql-k8s-2"], [6, 514877944, "no recovery target specified", "2024-08-20T22:38:08.654025+00:00", "postgresql-k8s-2"], [7, 721420448, "no recovery target specified", "2024-08-21T06:51:58.300517+00:00", "postgresql-k8s-2"], [8, 730339744, "no recovery target specified", "2024-08-21T07:10:08.821816+00:00", "postgresql-k8s-1"], [9, 762367408, "no recovery target specified", "2024-08-21T08:41:35.722519+00:00", "postgresql-k8s-1"], [10, 771752096, "no recovery target specified", "2024-08-21T08:59:49.737028+00:00", "postgresql-k8s-1"], [11, 788529312, "no recovery target specified", "2024-08-21T09:15:15.401925+00:00", "postgresql-k8s-1"], [12, 833224328, "no recovery target specified", "2024-08-21T09:36:03.989966+00:00", "postgresql-k8s-1"], [13, 905969824, "no recovery target specified", "2024-08-21T09:39:09.965176+00:00", "postgresql-k8s-0"], [14, 939524256, "no recovery target specified", "2024-08-21T10:40:31.448781+00:00", "postgresql-k8s-0"], [15, 956301472, "no recovery target specified", "2024-08-21T10:41:52.053647+00:00", "postgresql-k8s-0"], [16, 958223832, "no recovery target specified", "2024-08-21T10:45:13.033993+00:00", "postgresql-k8s-1"], [17, 1140850848, "no recovery target specified", "2024-08-21T15:34:37.070832+00:00", "postgresql-k8s-1"], [18, 1241514144, "no recovery target specified", "2024-08-21T16:18:22.976333+00:00", "postgresql-k8s-1"], [19, 1258291360, "no recovery target specified", "2024-08-21T16:25:41.418802+00:00", "postgresql-k8s-2"], [20, 1292018432, "no recovery target specified", "2024-08-21T16:26:19.684027+00:00", "postgresql-k8s-0"], [21, 1516027008, "no recovery target specified", "2024-08-21T22:39:15.192982+00:00", "postgresql-k8s-0"], [22, 1543504032, "no recovery target specified", "2024-08-21T23:26:38.412046+00:00", "postgresql-k8s-1"], [23, 1593835680, "no recovery target specified", "2024-08-21T23:29:26.665688+00:00", "postgresql-k8s-1"], [24, 1610612896, "no recovery target specified", "2024-08-21T23:46:20.285545+00:00", "postgresql-k8s-0"], [25, 1660944544, "no recovery target specified", "2024-08-21T23:48:30.511019+00:00", "postgresql-k8s-0"], [26, 1711276192, "no recovery target specified", "2024-08-22T00:12:43.699291+00:00", "postgresql-k8s-0"], [27, 1728053408, "no recovery target specified"], [28, 1744988592, "no recovery target specified", "2024-08-22T00:14:57.120767+00:00", "postgresql-k8s-0"], [29, 1795162272, "no recovery target specified", "2024-08-22T00:16:12.049596+00:00", "postgresql-k8s-0"], [30, 1812354264, "no recovery target specified", "2024-08-22T00:20:34.588501+00:00", "postgresql-k8s-0"], [31, 1813536080, "no recovery target specified", "2024-08-22T00:22:15.579347+00:00", "postgresql-k8s-2"], [32, 1828716704, "no recovery target specified", "2024-08-22T00:37:41.109522+00:00", "postgresql-k8s-1"], [33, 1879048352, "no recovery target specified", "2024-08-22T00:48:01.151554+00:00", "postgresql-k8s-1"], [34, 1895825568, "no recovery target specified", "2024-08-22T00:50:00.100711+00:00", "postgresql-k8s-1"], [35, 1912602784, "no recovery target specified", "2024-08-22T01:08:31.767597+00:00", "postgresql-k8s-2"], [36, 1996488864, "no recovery target specified", "2024-08-22T01:16:46.836412+00:00", "postgresql-k8s-2"], [37, 2046820512, "no recovery target specified", "2024-08-22T01:22:14.300951+00:00", "postgresql-k8s-2"], [38, 2063597728, "no recovery target specified", "2024-08-22T01:37:09.822397+00:00", "postgresql-k8s-2"], [39, 2080374944, "no recovery target specified"], [40, 2097152160, "no recovery target specified"], [41, 2097913152, "no recovery target specified", "2024-08-22T01:39:17.450603+00:00", "postgresql-k8s-2"], [42, 2099390784, "no recovery target specified", "2024-08-22T01:40:38.110730+00:00", "postgresql-k8s-2"], [43, 2103169392, "no recovery target specified", "2024-08-22T01:48:05.111421+00:00", "postgresql-k8s-2"], [44, 2113929376, "no recovery target specified", "2024-08-22T01:49:39.543567+00:00", "postgresql-k8s-2"], [45, 2244206896, "no recovery target specified", "2024-08-22T06:53:08.623227+00:00", "postgresql-k8s-2"], [46, 2264924320, "no recovery target specified", "2024-08-22T07:21:39.970059+00:00", "postgresql-k8s-2"], [47, 2265634360, "no recovery target specified", "2024-08-22T07:22:18.270123+00:00", "postgresql-k8s-2"], [48, 2365587616, "no recovery target specified", "2024-08-22T11:12:21.166659+00:00", "postgresql-k8s-2"], [49, 2449473696, "no recovery target specified", "2024-08-22T13:05:11.466674+00:00", "postgresql-k8s-2"], [50, 2536272576, "no recovery target specified", "2024-08-22T14:49:17.826013+00:00", "postgresql-k8s-1"], [51, 2566914208, "no recovery target specified", "2024-08-22T15:25:50.047575+00:00", "postgresql-k8s-1"], [52, 2667577504, "no recovery target specified", "2024-08-22T18:47:57.516309+00:00", "postgresql-k8s-1"], [53, 2684354720, "no recovery target specified", "2024-08-22T18:48:34.816683+00:00", "postgresql-k8s-1"], [54, 2762614136, "no recovery target specified", "2024-08-22T20:36:30.098472+00:00", "postgresql-k8s-1"], [55, 2885681312, "no recovery target specified", "2024-08-22T23:18:51.124283+00:00", "postgresql-k8s-1"], [56, 3004380064, "no recovery target specified", "2024-08-23T02:00:48.446620+00:00", "postgresql-k8s-1"], [57, 3170893984, "no recovery target specified", "2024-08-23T08:12:11.806884+00:00", "postgresql-k8s-1"], [58, 3221225632, "no recovery target specified", "2024-08-23T08:20:27.883665+00:00", "postgresql-k8s-2"], [59, 3405775008, "no recovery target specified", "2024-08-23T14:03:52.177395+00:00", "postgresql-k8s-1"], [60, 3416972144, "no recovery target specified", "2024-08-23T14:32:54.018751+00:00", "postgresql-k8s-1"], [61, 3489661088, "no recovery target specified", "2024-08-23T16:57:24.741255+00:00", "postgresql-k8s-2"], [62, 3556769952, "no recovery target specified", "2024-08-23T17:57:45.602897+00:00", "postgresql-k8s-2"], [63, 3558569864, "no recovery target specified"], [64, 3574118728, "no recovery target specified", "2024-08-23T18:02:31.729314+00:00", "postgresql-k8s-0"], [65, 3623878816, "no recovery target specified", "2024-08-23T18:05:14.519149+00:00", "postgresql-k8s-0"], [66, 3640656032, "no recovery target specified", "2024-08-23T18:06:06.143730+00:00", "postgresql-k8s-0"], [67, 3657433248, "no recovery target specified", "2024-08-23T18:06:51.312237+00:00", "postgresql-k8s-0"], [68, 3674210464, "no recovery target specified", "2024-08-23T18:09:37.034095+00:00", "postgresql-k8s-0"], [69, 3675247600, "no recovery target specified", "2024-08-23T18:10:27.686663+00:00", "postgresql-k8s-0"], [70, 3676823952, "no recovery target specified"], [71, 3690987680, "no recovery target specified", "2024-08-23T18:11:35.688415+00:00", "postgresql-k8s-1"], [72, 3707764896, "no recovery target specified", "2024-08-23T18:12:43.540377+00:00", "postgresql-k8s-2"], [73, 3707765384, "no recovery target specified", "2024-08-23T18:14:16.000530+00:00", "postgresql-k8s-2"], [74, 3724542112, "no recovery target specified", "2024-08-23T18:15:19.460634+00:00", "postgresql-k8s-2"], [75, 3774873760, "no recovery target specified"], [76, 3808428192, "no recovery target specified", "2024-08-23T18:16:58.280797+00:00", "postgresql-k8s-2"], [77, 3858759840, "no recovery target specified", "2024-08-23T18:18:34.760956+00:00", "postgresql-k8s-2"], [78, 3875537056, "no recovery target specified", "2024-08-23T18:19:50.217081+00:00", "postgresql-k8s-2"], [79, 3892314272, "no recovery target specified", "2024-08-23T18:21:29.141244+00:00", "postgresql-k8s-2"], [80, 3909091488, "no recovery target specified", "2024-08-23T18:23:30.985445+00:00", "postgresql-k8s-2"], [81, 3910140768, "no recovery target specified", "2024-08-23T18:24:15.971929+00:00", "postgresql-k8s-0"], [82, 3910141376, "no recovery target specified", "2024-08-23T18:25:24.100697+00:00", "postgresql-k8s-0"], [83, 3925868704, "no recovery target specified", "2024-08-23T18:28:32.594821+00:00", "postgresql-k8s-0"], [84, 4042524456, "no recovery target specified", "2024-08-23T21:46:42.630083+00:00", "postgresql-k8s-0"], [85, 4143972512, "no recovery target specified"], [86, 4160749728, "no recovery target specified"], [87, 4177526944, "no recovery target specified"], [88, 4194304160, "no recovery target specified"], [89, 4195391232, "no recovery target specified"], [90, 4211081376, "no recovery target specified", "2024-08-24T00:23:03.148739+00:00", "postgresql-k8s-0"], [91, 4227858592, "no recovery target specified"], [92, 4244635808, "no recovery target specified", "2024-08-24T00:26:36.435160+00:00", "postgresql-k8s-0"], [93, 4261413024, "no recovery target specified", "2024-08-24T00:27:41.419898+00:00", "postgresql-k8s-0"], [94, 4580314344, "no recovery target specified", "2024-08-24T13:09:14.935904+00:00", "postgresql-k8s-1"], [95, 4825246152, "no recovery target specified", "2024-08-24T19:35:33.410112+00:00", "postgresql-k8s-1"], [96, 4826007536, "no recovery target specified", "2024-08-24T19:35:50.966294+00:00", "postgresql-k8s-1"], [97, 4949278880, "no recovery target specified", "2024-08-25T00:04:38.864026+00:00", "postgresql-k8s-1"], [98, 4952207120, "no recovery target specified", "2024-08-25T00:06:19.022716+00:00", "postgresql-k8s-0"], [99, 4966056096, "no recovery target specified"], [100, 4966277072, "no recovery target specified", "2024-08-25T00:07:21.725701+00:00", "postgresql-k8s-1"], [101, 5251268768, "no recovery target specified", "2024-08-25T07:46:38.725755+00:00", "postgresql-k8s-2"], [102, 5452595360, "no recovery target specified", "2024-08-25T14:37:56.383376+00:00", "postgresql-k8s-0"], [103, 5620367520, "no recovery target specified", "2024-08-25T18:30:26.815016+00:00", "postgresql-k8s-2"], [104, 5638915488, "no recovery target specified", "2024-08-25T19:22:39.372200+00:00", "postgresql-k8s-2"], [105, 5670884344, "no recovery target specified", "2024-08-25T19:23:33.546097+00:00", "postgresql-k8s-1"], [106, 5706456888, "no recovery target specified", "2024-08-25T20:49:17.647071+00:00", "postgresql-k8s-1"], [107, 5855248544, "no recovery target specified", "2024-08-26T02:10:18.555503+00:00", "postgresql-k8s-2"], [108, 5905580192, "no recovery target specified", "2024-08-26T02:17:25.228485+00:00", "postgresql-k8s-1"], [109, 5922357408, "no recovery target specified", "2024-08-26T02:19:28.717547+00:00", "postgresql-k8s-1"], [110, 5939134624, "no recovery target specified", "2024-08-26T02:29:27.486859+00:00", "postgresql-k8s-1"], [111, 5972689056, "no recovery target specified", "2024-08-26T02:42:15.938163+00:00", "postgresql-k8s-1"], [112, 6023020704, "no recovery target specified", "2024-08-26T02:53:43.801624+00:00", "postgresql-k8s-0"], [113, 6325010592, "no recovery target specified", "2024-08-26T14:07:39.477169+00:00", "postgresql-k8s-1"], [114, 6354223144, "no recovery target specified", "2024-08-26T15:32:59.513967+00:00", "postgresql-k8s-1"], [115, 6578829536, "no recovery target specified", "2024-08-27T00:43:10.203360+00:00", "postgresql-k8s-1"], [116, 6777995424, "no recovery target specified", "2024-08-27T08:07:52.321300+00:00", "postgresql-k8s-2"], [117, 6861881504, "no recovery target specified", "2024-08-27T08:21:58.494618+00:00", "postgresql-k8s-2"], [118, 7079985312, "no recovery target specified", "2024-08-27T16:53:25.903452+00:00", "postgresql-k8s-2"], [119, 7080546232, "no recovery target specified", "2024-08-27T16:53:58.291504+00:00", "postgresql-k8s-2"], [120, 7331643552, "no recovery target specified"], [121, 7348420768, "no recovery target specified", "2024-08-28T01:06:18.202914+00:00", "postgresql-k8s-2"], [122, 7734296736, "no recovery target specified", "2024-08-28T16:05:21.315844+00:00", "postgresql-k8s-2"], [123, 7902068896, "no recovery target specified", "2024-08-28T22:19:46.197479+00:00", "postgresql-k8s-2"], [124, 7913359576, "no recovery target specified", "2024-08-28T22:46:48.677040+00:00", "postgresql-k8s-2"], [125, 8120172704, "no recovery target specified"], [126, 8136949920, "no recovery target specified", "2024-08-29T06:49:26.705207+00:00", "postgresql-k8s-2"], [127, 8271167648, "no recovery target specified", "2024-08-29T12:03:47.894811+00:00", "postgresql-k8s-0"], [128, 8316160584, "no recovery target specified", "2024-08-29T12:33:56.290399+00:00", "postgresql-k8s-0"], [129, 8317565192, "no recovery target specified", "2024-08-29T12:35:38.187504+00:00", "postgresql-k8s-0"], [130, 8397712664, "no recovery target specified", "2024-08-29T13:48:47.623300+00:00", "postgresql-k8s-0"], [131, 8438939808, "no recovery target specified", "2024-08-29T13:53:47.226569+00:00", "postgresql-k8s-0"], [132, 8472494240, "no recovery target specified", "2024-08-29T14:55:09.374852+00:00", "postgresql-k8s-0"], [133, 8489271456, "no recovery target specified", "2024-08-29T14:56:59.536059+00:00", "postgresql-k8s-0"], [134, 8522825888, "no recovery target specified"], [135, 8539603104, "no recovery target specified"], [136, 8556380320, "no recovery target specified", "2024-08-29T15:03:11.053952+00:00", "postgresql-k8s-2"], [137, 8573157536, "no recovery target specified"], [138, 8589934752, "no recovery target specified", "2024-08-29T15:07:16.670443+00:00", "postgresql-k8s-2"], [139, 8606711968, "no recovery target specified"], [140, 8623489184, "no recovery target specified", "2024-08-29T15:09:12.662681+00:00", "postgresql-k8s-2"], [141, 8640266400, "no recovery target specified"], [142, 8657043616, "no recovery target specified"], [143, 8673820832, "no recovery target specified"], [144, 8690598048, "no recovery target specified", "2024-08-29T15:14:27.635523+00:00", "postgresql-k8s-0"], [145, 8692610208, "no recovery target specified", "2024-08-29T15:15:44.792364+00:00", "postgresql-k8s-0"], [146, 8693488168, "no recovery target specified", "2024-08-29T15:17:55.189785+00:00", "postgresql-k8s-0"], [147, 8707375264, "no recovery target specified", "2024-08-29T15:20:22.048054+00:00", "postgresql-k8s-2"], [148, 8724152320, "no recovery target specified", "2024-08-29T15:21:21.728036+00:00", "postgresql-k8s-0"], [149, 8774484128, "no recovery target specified"], [150, 8791261344, "no recovery target specified", "2024-08-29T15:24:09.805868+00:00", "postgresql-k8s-0"], [151, 8791765688, "no recovery target specified", "2024-08-29T15:25:36.534813+00:00", "postgresql-k8s-0"], [152, 8793930720, "no recovery target specified", "2024-08-29T15:27:39.748156+00:00", "postgresql-k8s-0"], [153, 8808038560, "no recovery target specified", "2024-08-29T15:29:39.029456+00:00", "postgresql-k8s-0"], [154, 8858370208, "no recovery target specified", "2024-08-29T15:33:07.519728+00:00", "postgresql-k8s-0"], [155, 8875147424, "no recovery target specified", "2024-08-29T15:34:40.624743+00:00", "postgresql-k8s-0"], [156, 8891924640, "no recovery target specified", "2024-08-29T15:35:54.169544+00:00", "postgresql-k8s-0"], [157, 8908701856, "no recovery target specified"], [158, 8925479072, "no recovery target specified", "2024-08-29T15:37:36.578660+00:00", "postgresql-k8s-0"], [159, 8942256288, "no recovery target specified", "2024-08-29T15:38:58.839557+00:00", "postgresql-k8s-0"], [160, 8942652016, "no recovery target specified"], [161, 8959033504, "no recovery target specified"], [162, 8975810720, "no recovery target specified"], [163, 8992587936, "no recovery target specified", "2024-08-29T15:46:43.588630+00:00", "postgresql-k8s-0"], [164, 9009365152, "no recovery target specified", "2024-08-29T15:47:54.457406+00:00", "postgresql-k8s-0"], [165, 9010215304, "no recovery target specified"], [166, 9026142368, "no recovery target specified"], [167, 9042919584, "no recovery target specified"], [168, 9059696800, "no recovery target specified", "2024-08-29T15:51:58.764079+00:00", "postgresql-k8s-0"], [169, 9076474016, "no recovery target specified", "2024-08-29T15:53:25.853032+00:00", "postgresql-k8s-0"], [170, 9093251232, "no recovery target specified"], [171, 9093458248, "no recovery target specified", "2024-08-29T15:58:07.552114+00:00", "postgresql-k8s-0"], [172, 9110028448, "no recovery target specified", "2024-08-29T15:58:51.464837+00:00", "postgresql-k8s-2"], [173, 9160360096, "no recovery target specified", "2024-08-29T16:02:38.447078+00:00", "postgresql-k8s-0"], [174, 9244246176, "no recovery target specified", "2024-08-29T16:22:30.692116+00:00", "postgresql-k8s-0"], [175, 9261023392, "no recovery target specified", "2024-08-29T16:23:22.120677+00:00", "postgresql-k8s-0"], [176, 9288409336, "no recovery target specified", "2024-08-29T17:32:28.241821+00:00", "postgresql-k8s-1"], [177, 9344909472, "no recovery target specified", "2024-08-29T18:29:28.763162+00:00", "postgresql-k8s-1"], [178, 9663676576, "no recovery target specified", "2024-08-30T04:58:20.083174+00:00", "postgresql-k8s-1"], [179, 9680453792, "no recovery target specified", "2024-08-30T05:12:05.623284+00:00", "postgresql-k8s-1"], [180, 9730785440, "no recovery target specified", "2024-08-30T07:08:15.004925+00:00", "postgresql-k8s-1"], [181, 9810237136, "no recovery target specified", "2024-08-30T10:24:07.234241+00:00", "postgresql-k8s-0"], [182, 9982443680, "no recovery target specified", "2024-08-30T15:10:50.299920+00:00", "postgresql-k8s-0"], [183, 10051283608, "no recovery target specified", "2024-08-30T18:01:18.251849+00:00", "postgresql-k8s-0"], [184, 10133438624, "no recovery target specified", "2024-08-30T21:16:43.488580+00:00", "postgresql-k8s-0"], [185, 10905190560, "no recovery target specified"], [186, 10921967776, "no recovery target specified"], [187, 10938744992, "no recovery target specified", "2024-09-01T00:05:18.863917+00:00", "postgresql-k8s-0"], [188, 10940765624, "no recovery target specified", "2024-09-01T00:08:19.321877+00:00", "postgresql-k8s-0"], [189, 10945085512, "no recovery target specified", "2024-09-01T00:15:10.611045+00:00", "postgresql-k8s-1"], [190, 11173626016, "no recovery target specified", "2024-09-01T08:51:05.830425+00:00", "postgresql-k8s-1"], [191, 11426276984, "no recovery target specified", "2024-09-01T15:38:50.611334+00:00", "postgresql-k8s-1"], [192, 11492393120, "no recovery target specified", "2024-09-01T16:37:05.956891+00:00", "postgresql-k8s-1"], [193, 11595880664, "no recovery target specified", "2024-09-01T20:50:27.115296+00:00", "postgresql-k8s-2"], [194, 11609833632, "no recovery target specified", "2024-09-01T21:00:43.492497+00:00", "postgresql-k8s-2"], [195, 11631935744, "no recovery target specified", "2024-09-01T21:55:23.494862+00:00", "postgresql-k8s-2"], [196, 11663238936, "no recovery target specified", "2024-09-01T23:11:59.513540+00:00", "postgresql-k8s-0"], [197, 11710496928, "no recovery target specified", "2024-09-01T23:53:18.776410+00:00", "postgresql-k8s-0"], [198, 11794383008, "no recovery target specified", "2024-09-02T03:13:46.615274+00:00", "postgresql-k8s-0"], [199, 12733907104, "no recovery target specified", "2024-09-03T10:49:54.016630+00:00", "postgresql-k8s-0"], [200, 12750684320, "no recovery target specified", "2024-09-03T11:07:28.008951+00:00", "postgresql-k8s-0"], [201, 12752886712, "no recovery target specified", "2024-09-03T11:12:43.150985+00:00", "postgresql-k8s-1"], [202, 19042931640, "no recovery target specified", "2024-09-03T14:52:37.649814+00:00", "postgresql-k8s-0"], [203, 19058917536, "no recovery target specified", "2024-09-03T14:59:22.994050+00:00", "postgresql-k8s-0"], [204, 22414412504, "no recovery target specified", "2024-09-03T16:44:15.523685+00:00", "postgresql-k8s-0"], [205, 34191966368, "no recovery target specified", "2024-09-03T22:23:48.272506+00:00", "postgresql-k8s-0"], [206, 49845108896, "no recovery target specified", "2024-09-04T05:41:55.227049+00:00", "postgresql-k8s-0"], [207, 67478041992, "no recovery target specified", "2024-09-04T14:01:36.483424+00:00", "postgresql-k8s-0"], [208, 85816067000, "no recovery target specified"], [209, 85816499352, "no recovery target specified", "2024-09-04T22:47:02.128342+00:00", "postgresql-k8s-0"], [210, 85849360888, "no recovery target specified", "2024-09-04T22:47:57.168929+00:00", "postgresql-k8s-0"], [211, 85899619016, "no recovery target specified", "2024-09-04T22:49:58.218220+00:00", "postgresql-k8s-0"], [212, 85967585104, "no recovery target specified", "2024-09-04T22:52:49.972051+00:00", "postgresql-k8s-0"]]

juju ssh --container postgresql postgresql-k8s/1 "find /var/log/postgresql/ -name postgresql*.log -not -empty -exec ls {} \; -exec cat {} \;":

2024-09-05 01:39:42 UTC [94097]: user=operator,db=postgres,app=[unknown],client=127.0.0.1,line=1 FATAL:  the database system is starting up
2024-09-05 01:39:49 UTC [94099]: user=operator,db=postgres,app=[unknown],client=127.0.0.1,line=1 FATAL:  the database system is starting up
2024-09-05 01:39:51 UTC [94101]: user=operator,db=postgres,app=[unknown],client=127.0.0.1,line=1 FATAL:  the database system is starting up
2024-09-05 01:39:51 UTC [94102]: user=operator,db=postgres,app=[unknown],client=127.0.0.1,line=1 FATAL:  the database system is starting up
2024-09-05 01:39:52 UTC [94103]: user=operator,db=postgres,app=[unknown],client=127.0.0.1,line=1 FATAL:  the database system is starting up
2024-09-05 01:39:59 UTC [94105]: user=operator,db=postgres,app=[unknown],client=127.0.0.1,line=1 FATAL:  the database system is starting up

@marceloneppel
Copy link
Member

Thanks for the details, @kelkawi-a!

Do you still have the Juju debug logs that show something (like the stack trace) from the errors shown in the Unit 1 status log? I mean, the errors in the start and update-status hooks. Those will be useful to understand what happened before the unit reached its current state.

Do you know if there are a lot of clients connecting to the database, especially through the read-only endpoints (replicas)?

If so, we can try to stop the PostgreSQL service in the replica by issuing the following command.

juju ssh --container postgresql postgresql-k8s/1 pebble stop postgresql

Then, after some seconds, we can start it again to see if it starts correctly.

juju ssh --container postgresql postgresql-k8s/1 pebble start postgresql

Also, did the chown command fix Unit 2?

@kelkawi-a
Copy link
Author

Unfortunately I don't have visibility on the logs that far back. Since this issue came up. Since reporting this initial bug, the units have re-configured themselves as follows:

postgresql-k8s/0                     maintenance  idle             reinitialising replica
postgresql-k8s/1                     waiting      idle             awaiting for member to start
postgresql-k8s/2*                    active       idle             Primary

Note: the Primary unit intermittently goes into a maintenance status with the message reconfiguring cluster.

I can confirm that there is 5 applications (3 units each) connecting to the postgresql-k8s application, each of them occupying a number of connection slots.

I've sent you an invite to try and debug this live on the environment if possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants