View Issue Details

IDProjectCategoryView StatusLast Update
0000568Pgpool-IIBugpublic2020-01-09 11:32
ReporterspaskalevAssigned Tohoshiai 
PrioritynormalSeveritymajorReproducibilityalways
Status feedbackResolutionopen 
PlatformLinuxOSVMware Photon OSOS Version3.0
Product Version4.0.7 
Target VersionFixed in Version 
Summary0000568: Reattaching detached primary doesn't bring it up
DescriptionI'm running postgres with two async standbys, external failover agent (repmgrd) and pgpool as a proxy that has all postgres nodes added and can figure out which node is the primary and send traffic to it.

I'm running pgpool with health check enabled so that failed nodes are automatically detached.

When the primary is detached (either by a failed health check or manually through pcp_detach_node) and then attached back with pcp_node_attach pgpool continues to show its status as 'down' and will not send traffic to it.
Steps To Reproduce$ seq 0 2 | xargs -n 1 pcp_node_info -w
postgres-0 5432 2 -nan up primary 0 2019-12-16 08:01:14
postgres-1 5432 2 -nan up standby 0 2019-12-16 08:01:14
postgres-2 5432 2 -nan up standby 0 2019-12-16 08:01:14

$ pcp_detach_node -w 0
pcp_detach_node -- Command Successful

$ seq 0 2 | xargs -n 1 pcp_node_info -w
postgres-0 5432 3 -nan down primary 0 2019-12-16 08:02:44
postgres-1 5432 2 -nan up standby 0 2019-12-16 08:01:14
postgres-2 5432 2 -nan up standby 0 2019-12-16 08:01:14

$ pcp_attach_node -w 0
pcp_attach_node -- Command Successful

$ seq 0 2 | xargs -n 1 pcp_node_info -w
postgres-0 5432 3 -nan down primary 0 2019-12-16 08:02:44
postgres-1 5432 2 -nan up standby 0 2019-12-16 08:01:14
postgres-2 5432 2 -nan up standby 0 2019-12-16 08:01:14
Additional InformationHere is my pgpool config

sr_check_user = '...'
sr_check_password = '...'
sr_check_database = '...'
sr_check_period = 1

connect_timeout = 3000
health_check_timeout = 5
health_check_period = 1
health_check_max_retries = 0
health_check_user = '...'
health_check_password = '...'
health_check_database = '...'

search_primary_node_timeout = 0
detach_false_primary = on
failover_on_backend_error = on
failover_command = '/scripts/pgpool_failover.sh %h' # custom script for events

listen_addresses = '*'
port = 5432
socket_dir = '/var/run/postgresql'
listen_backlog_multiplier = 1
serialize_accept = on
pcp_listen_addresses = '' # Disable PCP over TCP
pcp_socket_dir = '/tmp'
enable_pool_hba = on
# Note: this is a file name, not a password
pool_passwd = 'pool_passwd'
allow_clear_text_frontend_auth = off

# - Concurrent session and pool size -
num_init_children = 1200
max_pool = 1

# - Life time -
serialize_accept = on
child_life_time = 0
child_max_connections = 1
connection_life_time = 3900
client_idle_limit = 720

log_destination = 'syslog'
syslog_facility = 'LOCAL0'
syslog_ident = 'pgpool'
pid_file_name = '/var/run/postgresql/pgpool.pid'
logdir = '/var/log/postgresql'

connection_cache = off
load_balancing = off

master_slave_mode = on
master_slave_sub_mode = 'stream'

backend_hostname0 = 'postgres-0'
backend_port0 = 5432
backend_weight0 = 0
backend_data_directory0 = '/data'
backend_flag0 = 'ALLOW_TO_FAILOVER'

backend_hostname1 = 'postgres-1'
backend_port1 = 5432
backend_weight1 = 0
backend_data_directory1 = '/data'
backend_flag1 = 'ALLOW_TO_FAILOVER'

backend_hostname2 = 'postgres-2'
backend_port2 = 5432
backend_weight2 = 0
backend_data_directory2 = '/data'
backend_flag2 = 'ALLOW_TO_FAILOVER'
TagsNo tags attached.

Activities

spaskalev

2019-12-16 22:19

reporter   ~0003023

The issue seems to be caused by the infinite search for a new primary node - when I detach the primary pgpool starts looking for a new primary

2019-12-16 13:06:44: pid 16222: LOG: find_primary_node: standby node is 1
2019-12-16 13:06:44: pid 16222: LOG: find_primary_node: standby node is 2
2019-12-16 13:06:45: pid 16222: LOG: find_primary_node: standby node is 1
2019-12-16 13:06:45: pid 16222: LOG: find_primary_node: standby node is 2
... - logs repeat

I've tried limiting the search_primary_node_timeout and re-attaching the existing primary after pgpool has given up on finding a new primary then correctly attaches it in an up state

hoshiai

2020-01-06 09:34

developer   ~0003038

> The issue seems to be caused by the infinite search for a new primary node - when I detach the primary pgpool starts looking for a new primary

You're right. when this case, pgpool search new primary forever. because pgpool search new primary in active standby nodes.

> I've tried limiting the search_primary_node_timeout and re-attaching the existing primary after pgpool has given up on finding a new primary then correctly attaches it in an up state

Yes, your handling is no problem.

spaskalev

2020-01-06 20:39

reporter   ~0003039

True, but I don't want to limit the primary search interval, in case a real failover happens - as then pgpool has the wrong notion on which node is primary.

Alternatively, I need a way to trigger the primary search on some interval to detect a failover/switchover without any nodes going down.

hoshiai

2020-01-08 10:52

developer   ~0003046

pgpool can execute one failover process(contain failover and failback, attach node, detach node internally).
search of new primary is included in failover process, so next failover process(pcp_attach_node) is not executed until find new primary.

Currently, if pgpool detect down of primary node, pgpool run on the premise that other standby node is promoted to new primary.
This behavior is for that pgpool don't think to use external failover system together in SR mode.

spaskalev

2020-01-08 20:18

reporter   ~0003052

I agree, but I think that it is a valid use case for an external failover. In my setup I use multiple pgpool instances for different clients to provide high availability and load balancing over pgpool itself. The multiple instances of pgpool don't know about each other and don't use the watchdog/virtual ip mechanism. This way multiple pgpool instances running on different machines can be used concurrently.

If one instance of the pgpools looses connectivity temporary to the postgres primary that doesn't mean that it should elect a new primary - only that it lost connectivity. Then after a while say the primary comes back (I currently do this manually via a cronjob, but I see that it is now available as a feature in pgpool 4.1.0) - and I would expect that pgpool would just start proxying traffic to it.

All of that without triggering failover on the actual postgres node. This way the behavior of my setup is decoupled and I can modify different parts without changing the rest.

hoshiai

2020-01-09 11:24

developer   ~0003054

I understand your thinking.
However, I think that watchdog feature is a satisfactory too. If use watchdog in pgpool, it is no problem for that one pgpool node losting primary node temporarily . And multiple pgpool instances running on different machines can be used concurrently too( not use VIP).
In general, it it very serious incident if detect down of primary node. pgpool can't continue until resolved this problem.

Currently, we are starting proposal and develop ofnext pgpool version. If you need some new feature, please suggest in ML.

hoshiai

2020-01-09 11:32

developer   ~0003055

And if this is temporarily down, you will resolve it by that failover condition is more easy(for example increasing health_check_max_retries or health_check_timeout ).

Issue History

Date Modified Username Field Change
2019-12-16 17:21 spaskalev New Issue
2019-12-16 22:19 spaskalev Note Added: 0003023
2019-12-24 15:04 hoshiai Assigned To => hoshiai
2019-12-24 15:04 hoshiai Status new => assigned
2020-01-06 09:34 hoshiai Status assigned => feedback
2020-01-06 09:34 hoshiai Note Added: 0003038
2020-01-06 20:39 spaskalev Note Added: 0003039
2020-01-06 20:39 spaskalev Status feedback => assigned
2020-01-08 10:52 hoshiai Status assigned => feedback
2020-01-08 10:52 hoshiai Note Added: 0003046
2020-01-08 20:18 spaskalev Note Added: 0003052
2020-01-08 20:18 spaskalev Status feedback => assigned
2020-01-09 11:24 hoshiai Status assigned => feedback
2020-01-09 11:24 hoshiai Note Added: 0003054
2020-01-09 11:32 hoshiai Note Added: 0003055