View Issue Details
| ID | Project | Category | View Status | Date Submitted | Last Update |
|---|---|---|---|---|---|
| 0000729 | Pgpool-II | Bug | public | 2021-08-12 16:30 | 2022-01-20 17:36 |
| Reporter | eldad | Assigned To | pengbo | ||
| Priority | high | Severity | major | Reproducibility | have not tried |
| Status | assigned | Resolution | open | ||
| Product Version | 4.1.2 | ||||
| Summary | 0000729: Failover was not triggered | ||||
| Description | Hi, We had a a crash on the master host and it went down. lrefraprdpdb04 is master and lrefraprdpdb03 was slave. pgpool failover was working well but it didn't tirigger PG failover after 3 failed health check. each has PG 12 and pgpool 4.1.2 with HA. this is the log from slave once failover started: Aug 11 06:00:08 lrefraprdpdb03 pgpool[29116]: [24-1] 2021-08-11 06:00:08: pid 29116: LOG: remote node "lrefraprdpdb04:5400 Linux lrefraprdpdb04.infra.mms" is shutting down Aug 11 06:00:08 lrefraprdpdb03 pgpool[29116]: [25-1] 2021-08-11 06:00:08: pid 29116: LOG: watchdog cluster has lost the coordinator node Aug 11 06:00:08 lrefraprdpdb03 pgpool[29116]: [26-1] 2021-08-11 06:00:08: pid 29116: LOG: removing the remote node "lrefraprdpdb04:5400 Linux lrefraprdpdb04.infra.mms" from watchdog cluster master Aug 11 06:00:08 lrefraprdpdb03 pgpool[29116]: [27-1] 2021-08-11 06:00:08: pid 29116: LOG: We have lost the cluster master node "lrefraprdpdb04:5400 Linux lrefraprdpdb04.infra.mms" Aug 11 06:00:08 lrefraprdpdb03 pgpool[29116]: [28-1] 2021-08-11 06:00:08: pid 29116: LOG: watchdog node state changed from [STANDBY] to [JOINING] Aug 11 06:00:12 lrefraprdpdb03 pgpool[29116]: [29-1] 2021-08-11 06:00:12: pid 29116: LOG: watchdog node state changed from [JOINING] to [INITIALIZING] Aug 11 06:00:13 lrefraprdpdb03 pgpool[29116]: [30-1] 2021-08-11 06:00:13: pid 29116: LOG: I am the only alive node in the watchdog cluster Aug 11 06:00:13 lrefraprdpdb03 pgpool[29116]: [30-2] 2021-08-11 06:00:13: pid 29116: HINT: skipping stand for coordinator state Aug 11 06:00:13 lrefraprdpdb03 pgpool[29116]: [31-1] 2021-08-11 06:00:13: pid 29116: LOG: watchdog node state changed from [INITIALIZING] to [MASTER] Aug 11 06:00:13 lrefraprdpdb03 pgpool[29116]: [32-1] 2021-08-11 06:00:13: pid 29116: LOG: I am announcing my self as master/coordinator watchdog node Aug 11 06:00:17 lrefraprdpdb03 pgpool[29116]: [33-1] 2021-08-11 06:00:17: pid 29116: LOG: I am the cluster leader node Aug 11 06:00:17 lrefraprdpdb03 pgpool[29116]: [33-2] 2021-08-11 06:00:17: pid 29116: DETAIL: our declare coordinator message is accepted by all nodes Aug 11 06:00:17 lrefraprdpdb03 pgpool[29116]: [34-1] 2021-08-11 06:00:17: pid 29116: LOG: setting the local node "lrefraprdpdb03:5400 Linux lrefraprdpdb03.infra.mms" as watchdog cluster master Aug 11 06:00:17 lrefraprdpdb03 pgpool[29116]: [35-1] 2021-08-11 06:00:17: pid 29116: LOG: I am the cluster leader node. Starting escalation process Aug 11 06:00:17 lrefraprdpdb03 pgpool[29116]: [36-1] 2021-08-11 06:00:17: pid 29116: LOG: escalation process started with PID:811 Aug 11 06:00:17 lrefraprdpdb03 pgpool[29116]: [37-1] 2021-08-11 06:00:17: pid 29116: LOG: watchdog escalation process with pid: 811 exit with SUCCESS. Failover for the PG should be triggered by pgpool after 3 failed health check but it didn’t trigger the failover due to this error, which I didn’t see before. Aug 11 06:04:10 lrefraprdpdb03 pgpool[29438]: [64-1] 2021-08-11 06:04:10: pid 29438: ERROR: failed to make persistent db connection Aug 11 06:04:10 lrefraprdpdb03 pgpool[29438]: [64-2] 2021-08-11 06:04:10: pid 29438: DETAIL: connection to host:"lrefraprdpdb04:5432" failed Aug 11 06:04:10 lrefraprdpdb03 pgpool[29438]: [65-1] 2021-08-11 06:04:10: pid 29438: LOG: health check failed on node 1 (timeout:1) Aug 11 06:04:10 lrefraprdpdb03 pgpool[29438]: [66-1] 2021-08-11 06:04:10: pid 29438: LOG: received degenerate backend request for node_id: 1 from pid [29438] Aug 11 06:04:10 lrefraprdpdb03 pgpool[29438]: [67-1] 2021-08-11 06:04:10: pid 29438: LOG: failed to connect to watchdog command server socket Aug 11 06:04:10 lrefraprdpdb03 pgpool[29438]: [67-2] 2021-08-11 06:04:10: pid 29438: DETAIL: connect on "/tmp/.s.PGPOOLWD_CMD.9000" failed with reason: "No such file or directory" Aug 11 06:04:10 lrefraprdpdb03 pgpool[29438]: [68-1] 2021-08-11 06:04:10: pid 29438: WARNING: failover command on watchdog failed Aug 11 06:04:10 lrefraprdpdb03 pgpool[29438]: [68-2] 2021-08-11 06:04:10: pid 29438: DETAIL: issue command to watchdog returned NULL Aug 11 06:04:10 lrefraprdpdb03 pgpool[29438]: [69-1] 2021-08-11 06:04:10: pid 29438: LOG: degenerate backend request for 1 node(s) from pid [29438] is canceled by other pgpool Aug 11 06:04:10 lrefraprdpdb03 pgpool[29336]: [43-1] 2021-08-11 06:04:10: pid 29336: LOG: trying connecting to PostgreSQL server on "lrefraprdpdb04:5432" by INET socket I still don’t see "/tmp/.s.PGPOOLWD_CMD.9000" on 03 but it does exists on 04. failover_on_backend_error is off as we relay on health check to tirgger the failover. Regards, Eldad | ||||
| Steps To Reproduce | see description | ||||
| Additional Information | CentOS Linux release 7.8 | ||||
| Tags | failover | ||||
|
|
> We had a a crash on the master host and it went down. > lrefraprdpdb04 is master and lrefraprdpdb03 was slave. Are pgpool running on lrefraprdpdb04 and lrefraprdpdb03 or other servers? "We had a a crash on the master host and it went down" means both of PostgreSQL and pgpool went down. If you are using two pgpool nodes, you need to enable "enable_consensus_with_half_votes" to trigger failover when one PostgreSQL and pgpool go down. https://www.pgpool.net/docs/41/en/html/runtime-watchdog-config.html#CONFIG-WATCHDOG-FAILOVER-BEHAVIOR > I still don’t see "/tmp/.s.PGPOOLWD_CMD.9000" on 03 but it does exists on 04. Before you stop PostgreSQL, is pgpool watchdog running? You can use pcp_watchdog_info command to check the state of watchdog? https://www.pgpool.net/docs/41/ja/html/pcp-watchdog-info.html |
|
|
> Are pgpool running on lrefraprdpdb04 and lrefraprdpdb03 or other servers? both pgpool and PG are running on lrefraprdpdb04 and lrefraprdpdb03 enable_consensus_with_half_votes is enabled and failover was tested before. > Before you stop PostgreSQL, is pgpool watchdog running? I didn't stop PG, the host itself stopped unexpectedly pcp_watchdog_info on 04: 2 NO lrefraprdpdb03:5400 Linux lrefraprdpdb03.infra.mms lrefraprdpdb03 lrefraprdpdb04:5400 Linux lrefraprdpdb04.infra.mms lrefraprdpdb04 5400 9000 7 STANDBY lrefraprdpdb03:5400 Linux lrefraprdpdb03.infra.mms lrefraprdpdb03 5400 9000 4 MASTER on 03: ERROR: PCP: informing watchdog info failed DETAIL: invalid watchdog index but I can see the process exists on 03: postgres 29116 29114 0 Jan27 ? 00:08:01 pgpool: watchdog |
|
|
on 03: bash-4.2$ pcp_proc_count -h lrefraprdpdb03 -U ***** Password: 5200 8512 8497 8876 9446 7322 8038 8687 8574 8083 7380 793 888 28214 8581 7257 8389 7592 8016 5351 9013 8306 8663 6298 8610 8511 7319 27657 4509 8019 24064 8667 14850 8580 7224 9387 8766 8577 7991 7420 4643 7312 7419 3492 8036 9445 7112 7425 8760 8787 3382 8735 6739 8745 8544 7928 7028 28884 30381 5560 8670 5962 8268 8758 2073 7421 9206 9280 9036 8374 6064 8743 8474 4228 8203 5876 28647 9351 9323 9440 8333 7564 19666 5461 7774 3569 7395 893 7304 2973 32230 6773 1864 7410 30761 2187 7414 1964 9320 8010 7091 21211 8683 7223 9035 15369 8871 9278 9005 1772 8679 8369 8539 9437 8277 28111 8197 28654 4444 27126 8688 5184 5468 8779 5526 6248 6767 8482 4644 7242 5454 8502 8082 8394 7582 8017 8757 9030 4054 7308 7422 940 7211 8481 4066 3004 27630 7008 22610 5874 28203 8298 7475 8283 9384 7241 28762 8327 7476 8873 8317 1449 6120 8328 8393 4133 6834 8491 8127 8305 7044 8029 7479 8121 8382 8595 30399 7034 32375 8324 4708 8031 8780 7571 7195 8308 7498 7558 9322 7758 8013 1640 8128 5875 8307 7199 8660 5967 5693 8489 9277 9294 8104 9034 8387 9266 9026 27852 1850 7327 8392 7761 7016 8904 29178 9006 7296 3550 7426 7595 8706 8657 1594 8646 8884 2247 2786 19303 5286 7015 9315 8015 844 5678 28205 4449 7236 29096 21518 7718 8509 8560 8484 8302 4428 7027 29119 8206 8686 4783 1865 9045 28745 22488 6297 8883 5657 7423 8510 8513 7035 9037 8062 8768 2748 9032 8659 30946 7392 8114 24931 7723 26687 4636 4610 29111 8907 32211 9033 22968 9355 9044 8060 32756 7590 28768 954 8309 8367 3232 8059 8664 6778 8323 7030 29895 786 8508 7239 7289 -bash-4.2$ ls -la /tmp/.s* srwxrwxrwx 1 postgres postgres 0 Aug 22 08:08 /tmp/.s.PGSQL.5432 -rw------- 1 postgres postgres 40 Aug 22 08:08 /tmp/.s.PGSQL.5432.lock on 04: -bash-4.2$ ls -la /tmp/.s* srwxr-xr-x 1 postgres postgres 0 Aug 11 07:03 /tmp/.s.PGPOOLWD_CMD.9000 srwxrwxrwx 1 postgres postgres 0 Aug 11 07:03 /tmp/.s.PGSQL.5400 srwxrwxrwx 1 postgres postgres 0 Aug 22 07:50 /tmp/.s.PGSQL.5432 -rw------- 1 postgres postgres 40 Aug 22 07:50 /tmp/.s.PGSQL.5432.lock srwxrwxrwx 1 postgres postgres 0 Aug 11 07:03 /tmp/.s.PGSQL.9898 -bash-4.2$ pcp_proc_count -h lrefraprdpdb04 -U ***** Password: 3505 3506 3507 3508 3509 3510 3511 3512 3513 3514 3515 3516 3517 3518 3519 3520 3521 3522 3523 3524 3525 3526 3527 3528 3529 3530 3531 3532 3533 3534 3535 3536 3537 3538 3539 3540 3541 3542 3543 3544 3545 3546 3547 3548 3549 3550 3551 3552 3553 3554 3555 3556 3557 3558 3559 3560 3561 3562 3563 3564 3565 3566 3567 3568 3569 3570 3571 3572 3573 3574 3575 3576 3577 3578 3579 3580 3581 3582 3583 3584 3585 3586 3587 3588 3589 3590 3591 3592 3593 3594 3595 3596 3597 3598 3599 3600 3601 3602 3603 3604 3605 3606 3607 3608 3609 3610 3611 3612 3613 3614 3615 3616 3617 3618 3619 3620 3621 3622 3623 3624 3625 3626 3627 3628 3629 3630 3631 3632 3633 3634 3635 3636 3637 3638 3639 3640 3641 3642 3643 3644 3645 3646 3647 3648 3649 3650 3651 3652 3653 3654 3655 3656 3657 3658 3659 3660 3661 3662 3663 3664 3665 3666 3667 3668 3669 3670 3671 3672 3673 3674 3675 3676 3677 3678 3679 3680 3681 3682 3683 3684 3685 3686 3687 3688 3689 3690 3691 3692 3693 3694 3695 3696 3697 3698 3699 3700 3701 3702 3703 3704 3705 3706 3707 3708 3709 3710 3711 3712 3713 3714 3715 3716 3717 3718 3719 3720 3721 3722 3723 3724 3725 3726 3727 3728 3729 3730 3731 3732 3733 3734 3735 3736 3737 3738 3739 3740 3741 3742 3743 3744 3745 3746 3747 3748 3749 3750 3751 3752 3753 3754 3755 3756 3757 3758 3759 3760 3761 4142 3763 12578 3765 3766 3767 17672 3769 12580 3771 3772 3773 3774 3775 3776 3777 3778 3779 3780 3781 3782 3783 3784 3785 3786 3787 3788 3789 3790 3791 3792 3793 3794 3795 3796 3797 3798 3799 3800 3801 3802 3803 5732 |
|
|
Sorry for late response. > > Before you stop PostgreSQL, is pgpool watchdog running? > I didn't stop PG, the host itself stopped unexpectedly Did PostgreSQL and pgpool terminate unexpectedly? Could you share the pgpool log on lrefraprdpdb03? |
|
|
Hi, Did PostgreSQL and pgpool terminate unexpectedly? YES, the primary machine crashed. I shared the relevant part of the log in the ticket description, please review it. Thanks, Eldad |
|
|
Hi, Any update? Please note I can connect to the pgpool on lreorgprdpdb04 but still I can't see the hidden files in /tmp/ psql -p 5400 -h lreorgprdpdb04 -c 'show pool_nodes' node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay | replication_state | r eplication_sync_state | last_status_change ---------+----------------+------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+-- ----------------------+--------------------- 0 | lreorgprdpdb03 | 5432 | up | 1.000000 | primary | 88356 | true | 0 | | | 2021-01-18 22:29:19 1 | lreorgprdpdb04 | 5432 | up | 0.000000 | standby | 0 | false | 0 | | | 2021-01-18 22:29:19 |
|
|
Sorry for late response. Let me confirm your scenario. You are using tow servers: - server lrefraprdpdb03 PostgreSQL primary Pgpool-II - server lrefraprdpdb04 PstgreSQL standby Pgpool-II You shutdown the server "lrefraprdpdb03". PostgreSQL on server lrefraprdpdb04 should be promoted, but Pgpool-II on lrefraprdpdb04 didn't trigger the failover. Are you setting the following parameters? failover_when_quorum_exists = on failover_require_consensus = on If you could share the full logs of pgpool, it may help us to identify the cause. |
|
|
Hi, The scenario is not correct. 04 was master, 03 was slave. 04 was shutdown, and failover failed to 03 due to this error: Aug 11 06:04:10 lrefraprdpdb03 pgpool[29438]: [66-1] 2021-08-11 06:04:10: pid 29438: LOG: received degenerate backend request for node_id: 1 from pid [29438] Aug 11 06:04:10 lrefraprdpdb03 pgpool[29438]: [67-1] 2021-08-11 06:04:10: pid 29438: LOG: failed to connect to watchdog command server socket Aug 11 06:04:10 lrefraprdpdb03 pgpool[29438]: [67-2] 2021-08-11 06:04:10: pid 29438: DETAIL: connect on "/tmp/.s.PGPOOLWD_CMD.9000" failed with reason: "No such file or directory" Aug 11 06:04:10 lrefraprdpdb03 pgpool[29438]: [68-1] 2021-08-11 06:04:10: pid 29438: WARNING: failover command on watchdog failed Aug 11 06:04:10 lrefraprdpdb03 pgpool[29438]: [68-2] 2021-08-11 06:04:10: pid 29438: DETAIL: issue command to watchdog returned NULL failover_when_quorum_exists = off failover_require_consensus = off enable_consensus_with_half_votes = on pcp_port = 9898 pcp_socket_dir = '/tmp' Currently 03 is master and 04 is slave (we did manual failover) As I wrote, failover tested and worked before on this cluster. The problem is with the watchdog status on lrefraprdpdb03. pgpool is up and running but still when I run pcp_watchdog_info I get error: -bash-4.2$ pcp_watchdog_info -U pguser Password: ERROR: connection to socket "/tmp/.s.PGSQL.9898" failed with error "No such file or directory" -bash-4.2$ ls -la /tmp/.s* srwxrwxrwx 1 postgres postgres 0 Nov 8 14:30 /tmp/.s.PGSQL.5432 -rw------- 1 postgres postgres 40 Nov 8 14:30 /tmp/.s.PGSQL.5432.lock As you can see the .s.PGSQL.9898 deosn't exists in the server but I do see it on 04. question is why the files are not there? I assume that if I restart the pgpool it will be created. Is it something I need to monitor? I don't have the old pgpool log from the time of the failed failover, the relevant info is in the ticket. Regards, Eldad |
|
|
Hi, Same issue happened again on a different cluster, with same problem where the pgpool watchdog files didn't exists DETAIL: connect on "/tmp/.s.PGPOOLWD_CMD.9000" failed with reason: "No such file or directory" I'm attaching the logs from both servers for you to check. Regards, Eldad |
| Date Modified | Username | Field | Change |
|---|---|---|---|
| 2021-08-12 16:30 | eldad | New Issue | |
| 2021-08-12 16:30 | eldad | Tag Attached: failover | |
| 2021-08-13 09:18 | pengbo | Assigned To | => pengbo |
| 2021-08-13 09:18 | pengbo | Status | new => assigned |
| 2021-08-17 21:17 | pengbo | Note Added: 0003928 | |
| 2021-08-17 21:18 | pengbo | Status | assigned => feedback |
| 2021-08-22 15:17 | eldad | Note Added: 0003933 | |
| 2021-08-22 15:17 | eldad | Status | feedback => assigned |
| 2021-08-22 15:31 | eldad | Note Added: 0003934 | |
| 2021-09-01 14:39 | pengbo | Note Added: 0003936 | |
| 2021-09-01 14:40 | pengbo | Status | assigned => feedback |
| 2021-10-18 21:50 | eldad | Note Added: 0003943 | |
| 2021-10-18 21:50 | eldad | Status | feedback => assigned |
| 2021-10-25 17:21 | eldad | Note Added: 0003946 | |
| 2021-10-27 10:21 | pengbo | Note Added: 0003949 | |
| 2021-10-27 10:21 | pengbo | Status | assigned => feedback |
| 2021-11-08 23:22 | eldad | Note Added: 0003953 | |
| 2021-11-08 23:22 | eldad | Status | feedback => assigned |
| 2022-01-20 17:36 | eldad | Note Added: 0003985 | |
| 2022-01-20 17:36 | eldad | File Added: pgpool-slave.log.gz | |
| 2022-01-20 17:36 | eldad | File Added: pgpool-master.log.gz |