View Issue Details
| ID | Project | Category | View Status | Date Submitted | Last Update |
|---|---|---|---|---|---|
| 0000568 | Pgpool-II | Bug | public | 2019-12-16 17:21 | 2020-03-24 17:21 |
| Reporter | spaskalev | Assigned To | hoshiai | ||
| Priority | normal | Severity | major | Reproducibility | always |
| Status | feedback | Resolution | open | ||
| Platform | Linux | OS | VMware Photon OS | OS Version | 3.0 |
| Product Version | 4.0.7 | ||||
| Summary | 0000568: Reattaching detached primary doesn't bring it up | ||||
| Description | I'm running postgres with two async standbys, external failover agent (repmgrd) and pgpool as a proxy that has all postgres nodes added and can figure out which node is the primary and send traffic to it. I'm running pgpool with health check enabled so that failed nodes are automatically detached. When the primary is detached (either by a failed health check or manually through pcp_detach_node) and then attached back with pcp_node_attach pgpool continues to show its status as 'down' and will not send traffic to it. | ||||
| Steps To Reproduce | $ seq 0 2 | xargs -n 1 pcp_node_info -w postgres-0 5432 2 -nan up primary 0 2019-12-16 08:01:14 postgres-1 5432 2 -nan up standby 0 2019-12-16 08:01:14 postgres-2 5432 2 -nan up standby 0 2019-12-16 08:01:14 $ pcp_detach_node -w 0 pcp_detach_node -- Command Successful $ seq 0 2 | xargs -n 1 pcp_node_info -w postgres-0 5432 3 -nan down primary 0 2019-12-16 08:02:44 postgres-1 5432 2 -nan up standby 0 2019-12-16 08:01:14 postgres-2 5432 2 -nan up standby 0 2019-12-16 08:01:14 $ pcp_attach_node -w 0 pcp_attach_node -- Command Successful $ seq 0 2 | xargs -n 1 pcp_node_info -w postgres-0 5432 3 -nan down primary 0 2019-12-16 08:02:44 postgres-1 5432 2 -nan up standby 0 2019-12-16 08:01:14 postgres-2 5432 2 -nan up standby 0 2019-12-16 08:01:14 | ||||
| Additional Information | Here is my pgpool config sr_check_user = '...' sr_check_password = '...' sr_check_database = '...' sr_check_period = 1 connect_timeout = 3000 health_check_timeout = 5 health_check_period = 1 health_check_max_retries = 0 health_check_user = '...' health_check_password = '...' health_check_database = '...' search_primary_node_timeout = 0 detach_false_primary = on failover_on_backend_error = on failover_command = '/scripts/pgpool_failover.sh %h' # custom script for events listen_addresses = '*' port = 5432 socket_dir = '/var/run/postgresql' listen_backlog_multiplier = 1 serialize_accept = on pcp_listen_addresses = '' # Disable PCP over TCP pcp_socket_dir = '/tmp' enable_pool_hba = on # Note: this is a file name, not a password pool_passwd = 'pool_passwd' allow_clear_text_frontend_auth = off # - Concurrent session and pool size - num_init_children = 1200 max_pool = 1 # - Life time - serialize_accept = on child_life_time = 0 child_max_connections = 1 connection_life_time = 3900 client_idle_limit = 720 log_destination = 'syslog' syslog_facility = 'LOCAL0' syslog_ident = 'pgpool' pid_file_name = '/var/run/postgresql/pgpool.pid' logdir = '/var/log/postgresql' connection_cache = off load_balancing = off master_slave_mode = on master_slave_sub_mode = 'stream' backend_hostname0 = 'postgres-0' backend_port0 = 5432 backend_weight0 = 0 backend_data_directory0 = '/data' backend_flag0 = 'ALLOW_TO_FAILOVER' backend_hostname1 = 'postgres-1' backend_port1 = 5432 backend_weight1 = 0 backend_data_directory1 = '/data' backend_flag1 = 'ALLOW_TO_FAILOVER' backend_hostname2 = 'postgres-2' backend_port2 = 5432 backend_weight2 = 0 backend_data_directory2 = '/data' backend_flag2 = 'ALLOW_TO_FAILOVER' | ||||
| Tags | No tags attached. | ||||
|
|
The issue seems to be caused by the infinite search for a new primary node - when I detach the primary pgpool starts looking for a new primary 2019-12-16 13:06:44: pid 16222: LOG: find_primary_node: standby node is 1 2019-12-16 13:06:44: pid 16222: LOG: find_primary_node: standby node is 2 2019-12-16 13:06:45: pid 16222: LOG: find_primary_node: standby node is 1 2019-12-16 13:06:45: pid 16222: LOG: find_primary_node: standby node is 2 ... - logs repeat I've tried limiting the search_primary_node_timeout and re-attaching the existing primary after pgpool has given up on finding a new primary then correctly attaches it in an up state |
|
|
> The issue seems to be caused by the infinite search for a new primary node - when I detach the primary pgpool starts looking for a new primary You're right. when this case, pgpool search new primary forever. because pgpool search new primary in active standby nodes. > I've tried limiting the search_primary_node_timeout and re-attaching the existing primary after pgpool has given up on finding a new primary then correctly attaches it in an up state Yes, your handling is no problem. |
|
|
True, but I don't want to limit the primary search interval, in case a real failover happens - as then pgpool has the wrong notion on which node is primary. Alternatively, I need a way to trigger the primary search on some interval to detect a failover/switchover without any nodes going down. |
|
|
pgpool can execute one failover process(contain failover and failback, attach node, detach node internally). search of new primary is included in failover process, so next failover process(pcp_attach_node) is not executed until find new primary. Currently, if pgpool detect down of primary node, pgpool run on the premise that other standby node is promoted to new primary. This behavior is for that pgpool don't think to use external failover system together in SR mode. |
|
|
I agree, but I think that it is a valid use case for an external failover. In my setup I use multiple pgpool instances for different clients to provide high availability and load balancing over pgpool itself. The multiple instances of pgpool don't know about each other and don't use the watchdog/virtual ip mechanism. This way multiple pgpool instances running on different machines can be used concurrently. If one instance of the pgpools looses connectivity temporary to the postgres primary that doesn't mean that it should elect a new primary - only that it lost connectivity. Then after a while say the primary comes back (I currently do this manually via a cronjob, but I see that it is now available as a feature in pgpool 4.1.0) - and I would expect that pgpool would just start proxying traffic to it. All of that without triggering failover on the actual postgres node. This way the behavior of my setup is decoupled and I can modify different parts without changing the rest. |
|
|
I understand your thinking. However, I think that watchdog feature is a satisfactory too. If use watchdog in pgpool, it is no problem for that one pgpool node losting primary node temporarily . And multiple pgpool instances running on different machines can be used concurrently too( not use VIP). In general, it it very serious incident if detect down of primary node. pgpool can't continue until resolved this problem. Currently, we are starting proposal and develop ofnext pgpool version. If you need some new feature, please suggest in ML. |
|
|
And if this is temporarily down, you will resolve it by that failover condition is more easy(for example increasing health_check_max_retries or health_check_timeout ). |
|
|
I agree, but we are rather committed to our current architecture, so switching to the watchdog will need to be implemented, properly tested and so on, so I have to find a way around this for now without an architecture change. Currently I have patched the health check processes to skip the check if BACKEND_INFO(*node_id).role != ROLE_PRIMARY. This change appears to works fine for out case - if the primary fails the 'failover_on_backend_error' immediately detects this as the primary is constantly in use - and starts to look for a new primary. - if a standby fails the health check will disconnect it from the pool. Let me know what you think of this - I can send a patch where this is behind an option. Regards, Stanislav |
|
|
My bad, I meant - "to do the healtch check only if BACKEND_INFO(*node_id).role != ROLE_PRIMARY' |
|
|
I think that's fine, if external failover agent certainly startup new primary from active standby nodes when failover is happened by failover_on_backend_error in pgpool. In other words, it is a problem, if primary isn't switched when failover_on_backend_error is happend. |
|
|
In addition, pgpool has the timing of failover other than health check and failover_on_backend_error. pgpool do failover by doing shutdown on PostgreSQL while exist a connecting postgresql session. Please be carefull about this case. |
|
|
The shutdown case, yes - this is good. I wonder about the 'BACKEND_INFO(*node_id).role != ROLE_PRIMARY' check - after a few hours of testing I got a failure there and pgpool ran a health check for the primary. I have now changed my patch to /* * Skip healthcheck for primary nodes */ if ((*node_id) == REAL_PRIMARY_NODE_ID) { sleep(30); continue; } in the main health check loop. Is this the proper way to get the current primary node id ? Regards |
|
|
This could be a valid feature, applicable to other setups I guess - different, dynamic health check parameters depending on the node's role. So that replicas that aren't load can failover faster but primaries, that are heavily loaded can have increased health check settings. I know I can configure this per node, but its not dynamic. |
|
|
> in the main health check loop. Is this the proper way to get the current primary node id ? Yes, you're right. REAL_PRIMARY_NODE_ID is more better than ROLE_PRIMARY. BACKEND_INFO.role change status in a moment duirng failover(). |
| Date Modified | Username | Field | Change |
|---|---|---|---|
| 2019-12-16 17:21 | spaskalev | New Issue | |
| 2019-12-16 22:19 | spaskalev | Note Added: 0003023 | |
| 2019-12-24 15:04 | hoshiai | Assigned To | => hoshiai |
| 2019-12-24 15:04 | hoshiai | Status | new => assigned |
| 2020-01-06 09:34 | hoshiai | Status | assigned => feedback |
| 2020-01-06 09:34 | hoshiai | Note Added: 0003038 | |
| 2020-01-06 20:39 | spaskalev | Note Added: 0003039 | |
| 2020-01-06 20:39 | spaskalev | Status | feedback => assigned |
| 2020-01-08 10:52 | hoshiai | Status | assigned => feedback |
| 2020-01-08 10:52 | hoshiai | Note Added: 0003046 | |
| 2020-01-08 20:18 | spaskalev | Note Added: 0003052 | |
| 2020-01-08 20:18 | spaskalev | Status | feedback => assigned |
| 2020-01-09 11:24 | hoshiai | Status | assigned => feedback |
| 2020-01-09 11:24 | hoshiai | Note Added: 0003054 | |
| 2020-01-09 11:32 | hoshiai | Note Added: 0003055 | |
| 2020-03-22 16:00 | spaskalev | Note Added: 0003275 | |
| 2020-03-22 16:00 | spaskalev | Status | feedback => assigned |
| 2020-03-22 16:01 | spaskalev | Note Added: 0003276 | |
| 2020-03-23 16:37 | hoshiai | Note Added: 0003277 | |
| 2020-03-23 16:49 | hoshiai | Status | assigned => feedback |
| 2020-03-23 16:49 | hoshiai | Note Added: 0003278 | |
| 2020-03-23 17:44 | spaskalev | Note Added: 0003279 | |
| 2020-03-23 17:44 | spaskalev | Status | feedback => assigned |
| 2020-03-24 15:24 | spaskalev | Note Added: 0003280 | |
| 2020-03-24 17:21 | hoshiai | Status | assigned => feedback |
| 2020-03-24 17:21 | hoshiai | Note Added: 0003281 |