View Issue Details

IDProjectCategoryView StatusLast Update
0000568Pgpool-IIBugpublic2020-03-24 17:21
ReporterspaskalevAssigned Tohoshiai 
PrioritynormalSeveritymajorReproducibilityalways
Status feedbackResolutionopen 
PlatformLinuxOSVMware Photon OSOS Version3.0
Product Version4.0.7 
Target VersionFixed in Version 
Summary0000568: Reattaching detached primary doesn't bring it up
DescriptionI'm running postgres with two async standbys, external failover agent (repmgrd) and pgpool as a proxy that has all postgres nodes added and can figure out which node is the primary and send traffic to it.

I'm running pgpool with health check enabled so that failed nodes are automatically detached.

When the primary is detached (either by a failed health check or manually through pcp_detach_node) and then attached back with pcp_node_attach pgpool continues to show its status as 'down' and will not send traffic to it.
Steps To Reproduce$ seq 0 2 | xargs -n 1 pcp_node_info -w
postgres-0 5432 2 -nan up primary 0 2019-12-16 08:01:14
postgres-1 5432 2 -nan up standby 0 2019-12-16 08:01:14
postgres-2 5432 2 -nan up standby 0 2019-12-16 08:01:14

$ pcp_detach_node -w 0
pcp_detach_node -- Command Successful

$ seq 0 2 | xargs -n 1 pcp_node_info -w
postgres-0 5432 3 -nan down primary 0 2019-12-16 08:02:44
postgres-1 5432 2 -nan up standby 0 2019-12-16 08:01:14
postgres-2 5432 2 -nan up standby 0 2019-12-16 08:01:14

$ pcp_attach_node -w 0
pcp_attach_node -- Command Successful

$ seq 0 2 | xargs -n 1 pcp_node_info -w
postgres-0 5432 3 -nan down primary 0 2019-12-16 08:02:44
postgres-1 5432 2 -nan up standby 0 2019-12-16 08:01:14
postgres-2 5432 2 -nan up standby 0 2019-12-16 08:01:14
Additional InformationHere is my pgpool config

sr_check_user = '...'
sr_check_password = '...'
sr_check_database = '...'
sr_check_period = 1

connect_timeout = 3000
health_check_timeout = 5
health_check_period = 1
health_check_max_retries = 0
health_check_user = '...'
health_check_password = '...'
health_check_database = '...'

search_primary_node_timeout = 0
detach_false_primary = on
failover_on_backend_error = on
failover_command = '/scripts/pgpool_failover.sh %h' # custom script for events

listen_addresses = '*'
port = 5432
socket_dir = '/var/run/postgresql'
listen_backlog_multiplier = 1
serialize_accept = on
pcp_listen_addresses = '' # Disable PCP over TCP
pcp_socket_dir = '/tmp'
enable_pool_hba = on
# Note: this is a file name, not a password
pool_passwd = 'pool_passwd'
allow_clear_text_frontend_auth = off

# - Concurrent session and pool size -
num_init_children = 1200
max_pool = 1

# - Life time -
serialize_accept = on
child_life_time = 0
child_max_connections = 1
connection_life_time = 3900
client_idle_limit = 720

log_destination = 'syslog'
syslog_facility = 'LOCAL0'
syslog_ident = 'pgpool'
pid_file_name = '/var/run/postgresql/pgpool.pid'
logdir = '/var/log/postgresql'

connection_cache = off
load_balancing = off

master_slave_mode = on
master_slave_sub_mode = 'stream'

backend_hostname0 = 'postgres-0'
backend_port0 = 5432
backend_weight0 = 0
backend_data_directory0 = '/data'
backend_flag0 = 'ALLOW_TO_FAILOVER'

backend_hostname1 = 'postgres-1'
backend_port1 = 5432
backend_weight1 = 0
backend_data_directory1 = '/data'
backend_flag1 = 'ALLOW_TO_FAILOVER'

backend_hostname2 = 'postgres-2'
backend_port2 = 5432
backend_weight2 = 0
backend_data_directory2 = '/data'
backend_flag2 = 'ALLOW_TO_FAILOVER'
TagsNo tags attached.

Activities

spaskalev

2019-12-16 22:19

reporter   ~0003023

The issue seems to be caused by the infinite search for a new primary node - when I detach the primary pgpool starts looking for a new primary

2019-12-16 13:06:44: pid 16222: LOG: find_primary_node: standby node is 1
2019-12-16 13:06:44: pid 16222: LOG: find_primary_node: standby node is 2
2019-12-16 13:06:45: pid 16222: LOG: find_primary_node: standby node is 1
2019-12-16 13:06:45: pid 16222: LOG: find_primary_node: standby node is 2
... - logs repeat

I've tried limiting the search_primary_node_timeout and re-attaching the existing primary after pgpool has given up on finding a new primary then correctly attaches it in an up state

hoshiai

2020-01-06 09:34

developer   ~0003038

> The issue seems to be caused by the infinite search for a new primary node - when I detach the primary pgpool starts looking for a new primary

You're right. when this case, pgpool search new primary forever. because pgpool search new primary in active standby nodes.

> I've tried limiting the search_primary_node_timeout and re-attaching the existing primary after pgpool has given up on finding a new primary then correctly attaches it in an up state

Yes, your handling is no problem.

spaskalev

2020-01-06 20:39

reporter   ~0003039

True, but I don't want to limit the primary search interval, in case a real failover happens - as then pgpool has the wrong notion on which node is primary.

Alternatively, I need a way to trigger the primary search on some interval to detect a failover/switchover without any nodes going down.

hoshiai

2020-01-08 10:52

developer   ~0003046

pgpool can execute one failover process(contain failover and failback, attach node, detach node internally).
search of new primary is included in failover process, so next failover process(pcp_attach_node) is not executed until find new primary.

Currently, if pgpool detect down of primary node, pgpool run on the premise that other standby node is promoted to new primary.
This behavior is for that pgpool don't think to use external failover system together in SR mode.

spaskalev

2020-01-08 20:18

reporter   ~0003052

I agree, but I think that it is a valid use case for an external failover. In my setup I use multiple pgpool instances for different clients to provide high availability and load balancing over pgpool itself. The multiple instances of pgpool don't know about each other and don't use the watchdog/virtual ip mechanism. This way multiple pgpool instances running on different machines can be used concurrently.

If one instance of the pgpools looses connectivity temporary to the postgres primary that doesn't mean that it should elect a new primary - only that it lost connectivity. Then after a while say the primary comes back (I currently do this manually via a cronjob, but I see that it is now available as a feature in pgpool 4.1.0) - and I would expect that pgpool would just start proxying traffic to it.

All of that without triggering failover on the actual postgres node. This way the behavior of my setup is decoupled and I can modify different parts without changing the rest.

hoshiai

2020-01-09 11:24

developer   ~0003054

I understand your thinking.
However, I think that watchdog feature is a satisfactory too. If use watchdog in pgpool, it is no problem for that one pgpool node losting primary node temporarily . And multiple pgpool instances running on different machines can be used concurrently too( not use VIP).
In general, it it very serious incident if detect down of primary node. pgpool can't continue until resolved this problem.

Currently, we are starting proposal and develop ofnext pgpool version. If you need some new feature, please suggest in ML.

hoshiai

2020-01-09 11:32

developer   ~0003055

And if this is temporarily down, you will resolve it by that failover condition is more easy(for example increasing health_check_max_retries or health_check_timeout ).

spaskalev

2020-03-22 16:00

reporter   ~0003275

I agree, but we are rather committed to our current architecture, so switching to the watchdog will need to be implemented, properly tested and so on, so I have to find a way around this for now without an architecture change.

Currently I have patched the health check processes to skip the check if BACKEND_INFO(*node_id).role != ROLE_PRIMARY.

This change appears to works fine for out case
- if the primary fails the 'failover_on_backend_error' immediately detects this as the primary is constantly in use - and starts to look for a new primary.
- if a standby fails the health check will disconnect it from the pool.

Let me know what you think of this - I can send a patch where this is behind an option.

Regards,
Stanislav

spaskalev

2020-03-22 16:01

reporter   ~0003276

My bad, I meant - "to do the healtch check only if BACKEND_INFO(*node_id).role != ROLE_PRIMARY'

hoshiai

2020-03-23 16:37

developer   ~0003277

I think that's fine, if external failover agent certainly startup new primary from active standby nodes when failover is happened by failover_on_backend_error in pgpool.

In other words, it is a problem, if primary isn't switched when failover_on_backend_error is happend.

hoshiai

2020-03-23 16:49

developer   ~0003278

In addition, pgpool has the timing of failover other than health check and failover_on_backend_error. pgpool do failover by doing shutdown on PostgreSQL while exist a connecting postgresql session. Please be carefull about this case.

spaskalev

2020-03-23 17:44

reporter   ~0003279

The shutdown case, yes - this is good. I wonder about the 'BACKEND_INFO(*node_id).role != ROLE_PRIMARY' check - after a few hours of testing I got a failure there and pgpool ran a health check for the primary.

I have now changed my patch to

      /*
       * Skip healthcheck for primary nodes
       */
      if ((*node_id) == REAL_PRIMARY_NODE_ID) {
         sleep(30);
         continue;
      }

in the main health check loop. Is this the proper way to get the current primary node id ?

Regards

spaskalev

2020-03-24 15:24

reporter   ~0003280

This could be a valid feature, applicable to other setups I guess - different, dynamic health check parameters depending on the node's role. So that replicas that aren't load can failover faster but primaries, that are heavily loaded can have increased health check settings. I know I can configure this per node, but its not dynamic.

hoshiai

2020-03-24 17:21

developer   ~0003281

> in the main health check loop. Is this the proper way to get the current primary node id ?

Yes, you're right. REAL_PRIMARY_NODE_ID is more better than ROLE_PRIMARY. BACKEND_INFO.role change status in a moment duirng failover().

Issue History

Date Modified Username Field Change
2019-12-16 17:21 spaskalev New Issue
2019-12-16 22:19 spaskalev Note Added: 0003023
2019-12-24 15:04 hoshiai Assigned To => hoshiai
2019-12-24 15:04 hoshiai Status new => assigned
2020-01-06 09:34 hoshiai Status assigned => feedback
2020-01-06 09:34 hoshiai Note Added: 0003038
2020-01-06 20:39 spaskalev Note Added: 0003039
2020-01-06 20:39 spaskalev Status feedback => assigned
2020-01-08 10:52 hoshiai Status assigned => feedback
2020-01-08 10:52 hoshiai Note Added: 0003046
2020-01-08 20:18 spaskalev Note Added: 0003052
2020-01-08 20:18 spaskalev Status feedback => assigned
2020-01-09 11:24 hoshiai Status assigned => feedback
2020-01-09 11:24 hoshiai Note Added: 0003054
2020-01-09 11:32 hoshiai Note Added: 0003055
2020-03-22 16:00 spaskalev Note Added: 0003275
2020-03-22 16:00 spaskalev Status feedback => assigned
2020-03-22 16:01 spaskalev Note Added: 0003276
2020-03-23 16:37 hoshiai Note Added: 0003277
2020-03-23 16:49 hoshiai Status assigned => feedback
2020-03-23 16:49 hoshiai Note Added: 0003278
2020-03-23 17:44 spaskalev Note Added: 0003279
2020-03-23 17:44 spaskalev Status feedback => assigned
2020-03-24 15:24 spaskalev Note Added: 0003280
2020-03-24 17:21 hoshiai Status assigned => feedback
2020-03-24 17:21 hoshiai Note Added: 0003281