0000568: Reattaching detached primary doesn't bring it up

ID	Project	Category	View Status	Date Submitted	Last Update

0000568	Pgpool-II	Bug	public	2019-12-16 17:21	2020-03-24 17:21

Reporter	spaskalev	Assigned To	hoshiai
Priority	normal	Severity	major	Reproducibility	always
Status	feedback	Resolution	open
Platform	Linux	OS	VMware Photon OS	OS Version	3.0
Product Version	4.0.7

Summary	0000568: Reattaching detached primary doesn't bring it up
Description	I'm running postgres with two async standbys, external failover agent (repmgrd) and pgpool as a proxy that has all postgres nodes added and can figure out which node is the primary and send traffic to it. I'm running pgpool with health check enabled so that failed nodes are automatically detached. When the primary is detached (either by a failed health check or manually through pcp_detach_node) and then attached back with pcp_node_attach pgpool continues to show its status as 'down' and will not send traffic to it.
Steps To Reproduce	$ seq 0 2 \| xargs -n 1 pcp_node_info -w postgres-0 5432 2 -nan up primary 0 2019-12-16 08:01:14 postgres-1 5432 2 -nan up standby 0 2019-12-16 08:01:14 postgres-2 5432 2 -nan up standby 0 2019-12-16 08:01:14 $ pcp_detach_node -w 0 pcp_detach_node -- Command Successful $ seq 0 2 \| xargs -n 1 pcp_node_info -w postgres-0 5432 3 -nan down primary 0 2019-12-16 08:02:44 postgres-1 5432 2 -nan up standby 0 2019-12-16 08:01:14 postgres-2 5432 2 -nan up standby 0 2019-12-16 08:01:14 $ pcp_attach_node -w 0 pcp_attach_node -- Command Successful $ seq 0 2 \| xargs -n 1 pcp_node_info -w postgres-0 5432 3 -nan down primary 0 2019-12-16 08:02:44 postgres-1 5432 2 -nan up standby 0 2019-12-16 08:01:14 postgres-2 5432 2 -nan up standby 0 2019-12-16 08:01:14
Additional Information	Here is my pgpool config sr_check_user = '...' sr_check_password = '...' sr_check_database = '...' sr_check_period = 1 connect_timeout = 3000 health_check_timeout = 5 health_check_period = 1 health_check_max_retries = 0 health_check_user = '...' health_check_password = '...' health_check_database = '...' search_primary_node_timeout = 0 detach_false_primary = on failover_on_backend_error = on failover_command = '/scripts/pgpool_failover.sh %h' # custom script for events listen_addresses = '*' port = 5432 socket_dir = '/var/run/postgresql' listen_backlog_multiplier = 1 serialize_accept = on pcp_listen_addresses = '' # Disable PCP over TCP pcp_socket_dir = '/tmp' enable_pool_hba = on # Note: this is a file name, not a password pool_passwd = 'pool_passwd' allow_clear_text_frontend_auth = off # - Concurrent session and pool size - num_init_children = 1200 max_pool = 1 # - Life time - serialize_accept = on child_life_time = 0 child_max_connections = 1 connection_life_time = 3900 client_idle_limit = 720 log_destination = 'syslog' syslog_facility = 'LOCAL0' syslog_ident = 'pgpool' pid_file_name = '/var/run/postgresql/pgpool.pid' logdir = '/var/log/postgresql' connection_cache = off load_balancing = off master_slave_mode = on master_slave_sub_mode = 'stream' backend_hostname0 = 'postgres-0' backend_port0 = 5432 backend_weight0 = 0 backend_data_directory0 = '/data' backend_flag0 = 'ALLOW_TO_FAILOVER' backend_hostname1 = 'postgres-1' backend_port1 = 5432 backend_weight1 = 0 backend_data_directory1 = '/data' backend_flag1 = 'ALLOW_TO_FAILOVER' backend_hostname2 = 'postgres-2' backend_port2 = 5432 backend_weight2 = 0 backend_data_directory2 = '/data' backend_flag2 = 'ALLOW_TO_FAILOVER'
Tags	No tags attached.

spaskalev 2019-12-16 22:19 reporter ~0003023	The issue seems to be caused by the infinite search for a new primary node - when I detach the primary pgpool starts looking for a new primary 2019-12-16 13:06:44: pid 16222: LOG: find_primary_node: standby node is 1 2019-12-16 13:06:44: pid 16222: LOG: find_primary_node: standby node is 2 2019-12-16 13:06:45: pid 16222: LOG: find_primary_node: standby node is 1 2019-12-16 13:06:45: pid 16222: LOG: find_primary_node: standby node is 2 ... - logs repeat I've tried limiting the search_primary_node_timeout and re-attaching the existing primary after pgpool has given up on finding a new primary then correctly attaches it in an up state

hoshiai 2020-01-06 09:34 developer ~0003038	> The issue seems to be caused by the infinite search for a new primary node - when I detach the primary pgpool starts looking for a new primary You're right. when this case, pgpool search new primary forever. because pgpool search new primary in active standby nodes. > I've tried limiting the search_primary_node_timeout and re-attaching the existing primary after pgpool has given up on finding a new primary then correctly attaches it in an up state Yes, your handling is no problem.

spaskalev 2020-01-06 20:39 reporter ~0003039	True, but I don't want to limit the primary search interval, in case a real failover happens - as then pgpool has the wrong notion on which node is primary. Alternatively, I need a way to trigger the primary search on some interval to detect a failover/switchover without any nodes going down.

hoshiai 2020-01-08 10:52 developer ~0003046	pgpool can execute one failover process(contain failover and failback, attach node, detach node internally). search of new primary is included in failover process, so next failover process(pcp_attach_node) is not executed until find new primary. Currently, if pgpool detect down of primary node, pgpool run on the premise that other standby node is promoted to new primary. This behavior is for that pgpool don't think to use external failover system together in SR mode.

spaskalev 2020-01-08 20:18 reporter ~0003052	I agree, but I think that it is a valid use case for an external failover. In my setup I use multiple pgpool instances for different clients to provide high availability and load balancing over pgpool itself. The multiple instances of pgpool don't know about each other and don't use the watchdog/virtual ip mechanism. This way multiple pgpool instances running on different machines can be used concurrently. If one instance of the pgpools looses connectivity temporary to the postgres primary that doesn't mean that it should elect a new primary - only that it lost connectivity. Then after a while say the primary comes back (I currently do this manually via a cronjob, but I see that it is now available as a feature in pgpool 4.1.0) - and I would expect that pgpool would just start proxying traffic to it. All of that without triggering failover on the actual postgres node. This way the behavior of my setup is decoupled and I can modify different parts without changing the rest.

hoshiai 2020-01-09 11:24 developer ~0003054	I understand your thinking. However, I think that watchdog feature is a satisfactory too. If use watchdog in pgpool, it is no problem for that one pgpool node losting primary node temporarily . And multiple pgpool instances running on different machines can be used concurrently too( not use VIP). In general, it it very serious incident if detect down of primary node. pgpool can't continue until resolved this problem. Currently, we are starting proposal and develop ofnext pgpool version. If you need some new feature, please suggest in ML.

hoshiai 2020-01-09 11:32 developer ~0003055	And if this is temporarily down, you will resolve it by that failover condition is more easy(for example increasing health_check_max_retries or health_check_timeout ).

spaskalev 2020-03-22 16:00 reporter ~0003275	I agree, but we are rather committed to our current architecture, so switching to the watchdog will need to be implemented, properly tested and so on, so I have to find a way around this for now without an architecture change. Currently I have patched the health check processes to skip the check if BACKEND_INFO(*node_id).role != ROLE_PRIMARY. This change appears to works fine for out case - if the primary fails the 'failover_on_backend_error' immediately detects this as the primary is constantly in use - and starts to look for a new primary. - if a standby fails the health check will disconnect it from the pool. Let me know what you think of this - I can send a patch where this is behind an option. Regards, Stanislav

spaskalev 2020-03-22 16:01 reporter ~0003276	My bad, I meant - "to do the healtch check only if BACKEND_INFO(*node_id).role != ROLE_PRIMARY'

hoshiai 2020-03-23 16:37 developer ~0003277	I think that's fine, if external failover agent certainly startup new primary from active standby nodes when failover is happened by failover_on_backend_error in pgpool. In other words, it is a problem, if primary isn't switched when failover_on_backend_error is happend.

hoshiai 2020-03-23 16:49 developer ~0003278	In addition, pgpool has the timing of failover other than health check and failover_on_backend_error. pgpool do failover by doing shutdown on PostgreSQL while exist a connecting postgresql session. Please be carefull about this case.

spaskalev 2020-03-23 17:44 reporter ~0003279	The shutdown case, yes - this is good. I wonder about the 'BACKEND_INFO(node_id).role != ROLE_PRIMARY' check - after a few hours of testing I got a failure there and pgpool ran a health check for the primary. I have now changed my patch to / * Skip healthcheck for primary nodes / if ((node_id) == REAL_PRIMARY_NODE_ID) { sleep(30); continue; } in the main health check loop. Is this the proper way to get the current primary node id ? Regards

spaskalev 2020-03-24 15:24 reporter ~0003280	This could be a valid feature, applicable to other setups I guess - different, dynamic health check parameters depending on the node's role. So that replicas that aren't load can failover faster but primaries, that are heavily loaded can have increased health check settings. I know I can configure this per node, but its not dynamic.

hoshiai 2020-03-24 17:21 developer ~0003281	> in the main health check loop. Is this the proper way to get the current primary node id ? Yes, you're right. REAL_PRIMARY_NODE_ID is more better than ROLE_PRIMARY. BACKEND_INFO.role change status in a moment duirng failover().

Date Modified	Username	Field	Change
2019-12-16 17:21	spaskalev	New Issue
2019-12-16 22:19	spaskalev	Note Added: 0003023
2019-12-24 15:04	hoshiai	Assigned To	=> hoshiai
2019-12-24 15:04	hoshiai	Status	new => assigned
2020-01-06 09:34	hoshiai	Status	assigned => feedback
2020-01-06 09:34	hoshiai	Note Added: 0003038
2020-01-06 20:39	spaskalev	Note Added: 0003039
2020-01-06 20:39	spaskalev	Status	feedback => assigned
2020-01-08 10:52	hoshiai	Status	assigned => feedback
2020-01-08 10:52	hoshiai	Note Added: 0003046
2020-01-08 20:18	spaskalev	Note Added: 0003052
2020-01-08 20:18	spaskalev	Status	feedback => assigned
2020-01-09 11:24	hoshiai	Status	assigned => feedback
2020-01-09 11:24	hoshiai	Note Added: 0003054
2020-01-09 11:32	hoshiai	Note Added: 0003055
2020-03-22 16:00	spaskalev	Note Added: 0003275
2020-03-22 16:00	spaskalev	Status	feedback => assigned
2020-03-22 16:01	spaskalev	Note Added: 0003276
2020-03-23 16:37	hoshiai	Note Added: 0003277
2020-03-23 16:49	hoshiai	Status	assigned => feedback
2020-03-23 16:49	hoshiai	Note Added: 0003278
2020-03-23 17:44	spaskalev	Note Added: 0003279
2020-03-23 17:44	spaskalev	Status	feedback => assigned
2020-03-24 15:24	spaskalev	Note Added: 0003280
2020-03-24 17:21	hoshiai	Status	assigned => feedback
2020-03-24 17:21	hoshiai	Note Added: 0003281

View Issue Details

Activities

Issue History