Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bitnami/postgresql-ha] ERROR: cannot execute INSERT in a read-only transaction #23219

Open
rohitkgupta opened this issue Feb 7, 2024 · 45 comments
Assignees
Labels
on-hold Issues or Pull Requests with this label will never be considered stale postgresql-ha tech-issues The user has a technical issue about an application

Comments

@rohitkgupta
Copy link

rohitkgupta commented Feb 7, 2024

Name and Version

bitnami/postgresql-ha 6.5.5

What architecture are you using?

None

What steps will reproduce the bug?

  1. Deploy PostgreSQL with HA architecture (using Replication Manager (repmgr) and Pgpool using bitnami helm charts
    Postgresql replicas=2 and Pgpool replicas=2
  2. C++ application connected to DB via Pgpool using the libpq library and executing SELECT, INSERT and Update queries.
  3. Occasionally, when a master PostgreSQL node switches to slave and slave node to master due to a restart or any other reason, Pgpool may continue to send all queries to the previous master, which is now acting as a slave. This results in the following error:ERROR: cannot execute INSERT in a read-only transaction
    This error occurs because Pgpool attempts to execute INSERT and UPDATE queries on a PostgreSQL instance that is in a read-only state.

Are you using any custom parameters or values?

----------------------------
pgPool-II configuration file
----------------------------

backend_clustering_mode = 'streaming_replication'

listen_addresses = '*'
port = '5432'
socket_dir = '/opt/build/pgpool/tmp'
reserved_connections = 0



pcp_listen_addresses = '*'
pcp_port = 9898
pcp_socket_dir = '/opt/build/pgpool/tmp'
listen_backlog_multiplier = 2
serialize_accept = off


backend_hostname0 = 'host1'
backend_port0 = 5432
backend_weight0 = 1
backend_data_directory0 = '/data'
backend_flag0 = 'ALLOW_TO_FAILOVER'
backend_application_name0 = 'server0'


enable_pool_hba = 'on'
authentication_timeout = '30'

allow_clear_text_frontend_auth = 'off'


ssl = 'on'
ssl_key = '/opt/build/pgpool/certs/tls.key'
ssl_cert = '/opt/build/pgpool/certs/tls.crt'

ssl_ciphers = 'HIGH:MEDIUM:+3DES:!aNULL'
ssl_prefer_server_ciphers = off
ssl_ecdh_curve = 'prime256v1'
ssl_dh_params_file = ''



num_init_children = '200'
max_pool = '15'


child_life_time = 5min
child_max_connections = 0
connection_life_time = 0
client_idle_limit = 0




log_destination = 'stderr'



log_connections = 'off'
log_disconnections = off
log_hostname = 'on'
log_statement = off
log_per_node_statement = 'off'
log_client_messages = off
log_standby_delay = 'if_over_threshold'


syslog_facility = 'LOCAL0'
syslog_ident = 'pgpool'



pid_file_name = '/opt/build/pgpool/tmp/pgpool.pid'
logdir = '/opt/build/pgpool/logs'



connection_cache = on

reset_query_list = 'ABORT; DISCARD ALL'



replicate_select = off

insert_lock = off
lobj_lock_table = ''


replication_stop_on_mismatch = off

failover_if_affected_tuples_mismatch = off



load_balance_mode = 'off'
ignore_leading_white_space = on
read_only_function_list = ''
write_function_list = ''

primary_routing_query_pattern_list = ''

database_redirect_preference_list = ''

app_name_redirect_preference_list = ''
allow_sql_comments = off

disable_load_balance_on_write = 'transaction'

dml_adaptive_object_relationship_list= ''

statement_level_load_balance = 'off'



sr_check_period = '30'
sr_check_user = 'repmgr'

sr_check_database = 'postgres'
delay_threshold = 10000000


follow_primary_command = ''


health_check_period = '30'
health_check_timeout = '10'
health_check_user = 'repmgr'

health_check_database = ''
health_check_max_retries = '5'
health_check_retry_delay = '5'
connect_timeout = 10000



failover_command = 'echo ">>> Failover - that will initialize new primary node search!"'
failback_command = ''

failover_on_backend_error = 'off'

detach_false_primary = off

search_primary_node_timeout = '0'


recovery_user = 'nobody'
recovery_password = ''

recovery_1st_stage_command = ''
recovery_2nd_stage_command = ''
recovery_timeout = 90
client_idle_limit_in_recovery = 0

auto_failback = off
auto_failback_interval = 1min



use_watchdog = off


trusted_servers = ''
ping_path = '/bin'


hostname0 = ''
wd_port0 = 9000
pgpool_port0 = 9999



wd_priority = 1

wd_authkey = ''

wd_ipc_socket_dir = '/tmp'



delegate_IP = ''
if_cmd_path = '/sbin'
if_up_cmd = '/usr/bin/sudo /sbin/ip addr add $_IP_$/24 dev eth0 label eth0:0'
if_down_cmd = '/usr/bin/sudo /sbin/ip addr del $_IP_$/24 dev eth0'
arping_path = '/usr/sbin'
arping_cmd = '/usr/bin/sudo /usr/sbin/arping -U $_IP_$ -w 1 -I eth0'


clear_memqcache_on_escalation = on
wd_escalation_command = ''
wd_de_escalation_command = ''


failover_when_quorum_exists = on

failover_require_consensus = on

allow_multiple_failover_requests_from_node = off


enable_consensus_with_half_votes = off




wd_lifecheck_method = 'heartbeat'
wd_interval = 10


heartbeat_hostname0 = ''
heartbeat_port0 = 9694
heartbeat_device0 = ''


wd_heartbeat_keepalive = 2
wd_heartbeat_deadtime = 30


wd_life_point = 3
wd_lifecheck_query = 'SELECT 1'
wd_lifecheck_dbname = 'template1'
wd_lifecheck_user = 'nobody'
wd_lifecheck_password = ''

relcache_expire = 0
relcache_size = 256

check_temp_table = catalog

check_unlogged_table = on
enable_shared_relcache = on

memory_cache_enabled = off
memqcache_method = 'shmem'
memqcache_memcached_host = 'localhost'
memqcache_memcached_port = 11211
memqcache_total_size = 64MB
memqcache_max_num_cache = 1000000
memqcache_expire = 0
memqcache_auto_cache_invalidation = on
memqcache_maxcache = 400kB
memqcache_cache_block_size = 1MB
memqcache_oiddir = '/var/log/pgpool/oiddir'
cache_safe_memqcache_table_list = ''
cache_unsafe_memqcache_table_list = ''
backend_hostname0 = 'postgresql-ha-postgresql-0.postgresql-ha-postgresql-headless.ns1.svc.cluster.local'
backend_port0 = 5432
backend_weight0 = 1
backend_data_directory0 = '/opt/build/pgpool/data'
backend_flag0 = 'ALLOW_TO_FAILOVER'
backend_hostname1 = 'postgresql-ha-postgresql-1.postgresql-ha-postgresql-headless.ns1.svc.cluster.local'
backend_port1 = 5432
backend_weight1 = 1
backend_data_directory1 = '/opt/build/pgpool/data'
backend_flag1 = 'ALLOW_TO_FAILOVER'

What is the expected behavior?

Pgpool should consistently direct queries to the master node and promptly switch to the new master node whenever a change occurs.

What do you see instead?

Pgpool doesn't switch to the master node until we restart pgpool deployment.

@rohitkgupta rohitkgupta added the tech-issues The user has a technical issue about an application label Feb 7, 2024
@github-actions github-actions bot added the triage Triage is needed label Feb 7, 2024
@rohitkgupta
Copy link
Author

rohitkgupta commented Feb 7, 2024


-----------------------------

PostgreSQL configuration file

-----------------------------

listen_addresses = '*'
port = '5432'
max_connections = '512'

ssl = 'on'
ssl_cert_file = '/certs/tls.crt'
ssl_key_file = '/certs/tls.key'

wal_level = 'hot_standby'
fsync = 'on'

max_wal_size = '400MB'

archive_mode = 'on'
archive_command = '/bin/true'

max_wal_senders = '16'
wal_keep_size = '128MB'

max_replication_slots = '10'

primary_conninfo = 'host=postgresql-ha-postgresql-1.postgresql-ha-postgresql-headless.ns1.svc.cluster.local port=5432 user=repmgr
promote_trigger_file = '/tmp/postgresql.trigger.5432'
hot_standby = 'on'

logging_collector = 'on'

log_directory = '/opt/build/postgresql/logs'
log_filename = 'postgresql.log'

log_connections = 'false'
log_disconnections = 'false'
log_hostname = 'true'

client_min_messages = 'error'

shared_preload_libraries = 'repmgr, pgaudit, repmgr'

include_dir = 'conf.d'

pgaudit.log_catalog = 'off'

@github-actions github-actions bot removed the triage Triage is needed label Feb 7, 2024
@github-actions github-actions bot assigned migruiz4 and unassigned carrodher Feb 7, 2024
Copy link

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

@github-actions github-actions bot added the stale 15 days without activity label Feb 23, 2024
@rohitkgupta
Copy link
Author

ping

@github-actions github-actions bot removed the stale 15 days without activity label Feb 26, 2024
@migruiz4
Copy link
Member

Hi @rohitkgupta,

I'm sorry for the late response.

I may need more information to help you with this issue. Does your issue persist over time when the master/slave switch happens? Or is this issue transient?

Please notice that short downtimes may happen when master is switched because in case of failure, the chart behavior would be the following:

  • Repmgr will notice the Master node is not available and select a new one.
  • Pgpool-II health_check will detect the master node is not available and stop accepting new requests.
  • Because of new request requests being refused, first the readiness probe will fail and pgpool-II will no longer receive traffic.
  • Eventually, pgpool-II will be restarted by the liveness probe failure, connecting to the new master node.

If the error ERROR: cannot execute INSERT in a read-only transaction is indeed transient, I would recommend fine-tunning both the pgpool-II healthcheck settings and the chart liveness/readiness probe settings for quicker recovery from failure.

@rohitkgupta
Copy link
Author

@migruiz4
Thanks for your response.
It is not a transient issue. Pgpool didn't restart and there were no probe failure for many hours, we had to restart Pgpool manually in this case.

Copy link

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

@github-actions github-actions bot added the stale 15 days without activity label Mar 15, 2024
@BK-STAR77
Copy link

BK-STAR77 commented Mar 17, 2024

same issue for me with pgpool:4.5.1-debian-12-r0 downgrade to 4.5.0-debian-12-r10 to fix

@github-actions github-actions bot removed the stale 15 days without activity label Mar 18, 2024
@migruiz4
Copy link
Member

Thank you very much for your feedback @BK-STAR77!

The container logic has not received any change from 4.5.0 to 4.5.1 that could cause this behavior. Checking pgpool release notes, it appears that it could be related to the change of the default value for failover_on_backend_shutdown.

Could you please confirm if setting failover_on_backend_shutdown=on fixes the issue? If that is the case, we should set it as the default value because that may be desired for Kubernetes.

@apoca
Copy link

apoca commented Mar 27, 2024

I have same issue here and reverting manually to 4.5.0-debian-12-r10 in deployment is working now... but, this could be very dangerous for production...

@BK-STAR77
Copy link

Thank you very much for your feedback @BK-STAR77!

The container logic has not received any change from 4.5.0 to 4.5.1 that could cause this behavior. Checking pgpool release notes, it appears that it could be related to the change of the default value for failover_on_backend_shutdown.

Could you please confirm if setting failover_on_backend_shutdown=on fixes the issue? If that is the case, we should set it as the default value because that may be desired for Kubernetes.

Is it a variable in the values.yaml ?

@apoca
Copy link

apoca commented Mar 28, 2024

I don't see this variable in values.yaml :/

@Stevenpc3
Copy link
Contributor

Stevenpc3 commented Apr 8, 2024

We are experiencing very bad failover issues also.

We updated to the newer charts from waaaaaay back and the latest charts appear to not failover correctly at all. We are using kubernetes.

We deploy then deploy our other pods to connect and all is fine.
Next we delete the primary node pod using kubectl delete pod
everything "appears" fine, but then we delete our other pods connected and they start a db-migration using liquibase and we get this error.

image

with a corresponding error in pgpool of the following

image

There have been other errors too that state it is in read-only mode

Caused by: org.postgresql.util.PSQLException: ERROR: cannot execute UPDATE in a read-only transaction

We have attempted to downgrad the image of pgpool to 4.5.0 while still using the current charts but pgpool will fail to start with the following error.

cp: cannot stat '/opt/bitnami/pgpool/etc/pgpool.conf.sample': No such file or directory

So currently, it appears the pg 16 chart version 14+ do not properly failover which makes HA useless and dangerous.

We were hoping to update since previous issues with pgpool and the chart caused pgpool to restart beacause of liveness probes and those updates to the probe script were added and we hoped all was fixed. There was also a previous issue where the DB would go into split brain pretty often and that was super dangerous because it wouldn't fail and you would have one outdated db and one current and it would read from both... so we had to shut off loadbalancing.

Edit:

After restarting PGPool they all EVENTUALLY came back up. Will test again to see if it works eventually WITHOUT restart and see if it replicates again.

I am also pouring through the other issues to see if any are related.

@apoca
Copy link

apoca commented Apr 8, 2024

@Stevenpc3 Exactly the same thing is happening to me, but as I only started using postgresql-ha about a month ago I thought it was a problem with some bad configuration on my part. But I had this problem for weeks, and only by deleting the extra primary one did it start to work, because it recreated everything, but automatically it doesn't work. In my case I ended up having different random data and I noticed that I had two primary.

@Stevenpc3
Copy link
Contributor

@apoca that issue you are describing is split-brain #20998 more on the internet about it too. The issue I mentioned at the bottom of my comment is more like replication stops working with no errors or issues detected and is super dangerous and makes failover pointless.

Both of those above issues are not related to the current issue in this thread though. This thread appears to be an issue with PGPool and/or the bitnami chart configs that use it.

@Stevenpc3
Copy link
Contributor

Stevenpc3 commented Apr 9, 2024

You can add the flag

postgresql-ha:
  pgpool:
    configuration: |
      connection_cache = false
      failover_on_backend_error = on
      failover_on_backend_shutdown = on

@RaddishIoW
Copy link

You can add the flag

postgresql-ha:
  pgpool:
    configuration: |
      connection_cache = false
      failover_on_backend_error = on
      failover_on_backend_shutdown = on

Is the "postgresql-ha:" name correct? My config file has the name as "postgresql:" - also, the pgpool: section is a top-level item in the values.yaml... Sorry if I'm missing something here - I've just started using this.

Is it not meant to be:

postgresql:
...
pgpool:
  configuration: |-
    connection_cache = false
    failover_on_backend_error = on
    failover_on_backend_shutdown = on

?

@Stevenpc3
Copy link
Contributor

My apologies, your way is correct for using the chart directly. I am using a wrapper chart. We have a chart that uses this chart as a dependency so we can template a few things.

@apoca
Copy link

apoca commented Apr 20, 2024

Since you talk about this, I have a question here... I have software as a service and I create a schema database and a user/password as well as the necessary permissions in real time... Sometimes I have a problem with Kind because saying that the database does not exist (but it exists) my question is whether this “false” cache solves this type of problem in pgpool?

@Stevenpc3
Copy link
Contributor

Stevenpc3 commented Jun 27, 2024

We had someone do a bit of testing...

We tried to reproduce the steps described in a comment by user "rafariossaa" (#23219 (comment)).

Summary:

  • The psql command from rafariossaa generally does not reproduce the pgpool issue.
  • We wrote a different psql command that generally does cause the pgpool issue (against the current beast configuration).
  • On recommendation, we tested against the current "vanilla" bitnami postgresql-ha (14.2.7) and were able to reproduce the problem again.
  • Setting some pgpool config items (connection_cache = false, failover_on_backend_error = on, failover_on_backend_shutdown = on) against the 14.2.7 version seems to fix it.

The longer version:

The comment from rafariossaa sets up the database, adds a separate pod to run psql from (which is a good trick), and then runs a command to connect to pgpool and continuously insert into a test table. The watch is inside psql, so there's one db connection that is kept open.

echo "insert into test values ('1234'); \watch" | psql -U postgres -h  mypg-postgresql-ha-pgpool

Generally, failing the primary pod here causes the psql connection to break and you can restart the command and things work seem fine.

You can change the psql command so that it creates a db connection for each command (here, an insert and a select).

while true; do echo "insert into test values ('123'); select count(*) from test;" | psql -t -U postgres -h mypg-postgresql-ha-pgpool ; sleep 3; done

Generally, this reproduces the pgpool issue. Failing the primary pod causes a few failed connections (connection refused while things fail over) and then an error about the read-only transaction ("ERROR: cannot execute INSERT in a read-only transaction"), where pgpool has apparently failed to point to the correct postgres instance.

On suggestion, we installed the "vanilla" bitnami charts. In the as-is configuration, this will reproduce the read-only transaction issue.

After adding the settings from the Github thread, I've been unable to reproduce the issue so far:

pgpool:
  configuration: |-
   connection_cache = false
   failover_on_backend_error = on
   failover_on_backend_shutdown = on

Will do more testing.

@rafariossaa
Copy link
Contributor

Hi,
Thanks for your feedback.

@rafariossaa
Copy link
Contributor

Not sure if after the feedback this issue could be closed.
@rohitkgupta WYT ?

@gtudan
Copy link
Contributor

gtudan commented Jul 19, 2024

We are still experiencing this issue. Disabling the connection cache feels like a bad move, as it pretty much defeats the purpose of using a pooler in the first place

@rafariossaa
Copy link
Contributor

rafariossaa commented Jul 24, 2024

@gtudan, consider that if you loose the master, you are going to have some connection cut/re-connection. Could you elaborate a bit more in you case or open a new issue ? In that case, feel free to link to this one if consider it.

@gtudan
Copy link
Contributor

gtudan commented Jul 24, 2024

@rafariossaa Sure, if the master is gone, the connection cache for that node has to be flushed.

In our scenario we observed writes on the slave even when both nodes were up - I assume that happened after a helm upgrade when a failover/failback was not properly performed, but I'm not sure.

I totally understand that disabling the connection cache prevents those issues, as the connections are reestablished every time. But I don't quite see why the cache should have to be disabled in normal operations to prevent write operations ending up on the slave.

Unfortunately, we had to rollback to the non-ha chart, because we could not figure out this issue, so I'm currently not able to reproduce this. We'll plan to retry in a few weeks, hopefully I'll be able to provide further insights then.

@rafariossaa
Copy link
Contributor

I don't like the idea of disable the cache either.
Thanks, don't hesitate to share your findings.

Copy link

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

@github-actions github-actions bot added the stale 15 days without activity label Aug 17, 2024
Copy link

Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.

@bitnami-bot bitnami-bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 22, 2024
@yukha-dw
Copy link
Contributor

yukha-dw commented Nov 7, 2024

You can add the flag

postgresql-ha:
  pgpool:
    configuration: |
      connection_cache = false
      failover_on_backend_error = on
      failover_on_backend_shutdown = on

This config does not working for me. One of 3 Pgpool nodes keeps setting previously replica node as primary node when doing helm upgrade.

@juan131 juan131 reopened this Nov 26, 2024
@github-actions github-actions bot added triage Triage is needed and removed solved labels Nov 26, 2024
@juan131 juan131 added on-hold Issues or Pull Requests with this label will never be considered stale and removed triage Triage is needed stale 15 days without activity labels Nov 26, 2024
@juan131 juan131 assigned juan131 and unassigned carrodher, fmulero and migruiz4 Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
on-hold Issues or Pull Requests with this label will never be considered stale postgresql-ha tech-issues The user has a technical issue about an application
Projects
None yet
Development

No branches or pull requests