[pgpool-general: 7534] Re: Strange behavior on switchover with detach_false_primary enabled

Fri Apr 30 15:36:31 JST 2021

> Yeah, we need to protect follow_child from detach_false_primary
> (actually executed in separate process: pgpool_worker). For this
> purpose I think we could use another shared memory variable
> Req_info->switching). The variable is set to true while failover
> procedure is running. detach_false_primary will not be executed if
> Req_info->switching is true. I will implement this in the next patch
> set.

Attached is v2 patch for this. I confirmed that in 3 PostgreSQL nodes
system (no watchdog), detach_false_primary works with following
scenario:

$ psql -p 11000 -c "show pool_nodes" test
 node_id | hostname | port  | status | pg_status | lb_weight |  role   | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change  
---------+----------+-------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0       | /tmp     | 11002 | up     | up        | 0.333333  | primary | primary | 0          | false             | 0                 |                   |                        | 2021-04-30 15:04:44
 1       | /tmp     | 11003 | up     | up        | 0.333333  | standby | standby | 0          | false             | 0                 | streaming         | async                  | 2021-04-30 15:04:44
 2       | /tmp     | 11004 | up     | up        | 0.333333  | standby | standby | 0          | true              | 0                 | streaming         | async                  | 2021-04-30 15:04:44
(3 rows)

$ pcp_detach_node -p 11001 -w 0

$ psql -p 11000 -c "show pool_nodes" test
 node_id | hostname | port  | status | pg_status | lb_weight |  role   | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change  
---------+----------+-------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0       | /tmp     | 11002 | up     | up        | 0.333333  | standby | standby | 0          | false             | 0                 | streaming         | async                  | 2021-04-30 15:05:54
 1       | /tmp     | 11003 | up     | up        | 0.333333  | primary | primary | 0          | false             | 0                 |                   |                        | 2021-04-30 15:05:31
 2       | /tmp     | 11004 | up     | up        | 0.333333  | standby | standby | 0          | true              | 0                 | streaming         | async                  | 2021-04-30 15:05:54
(3 rows)

$ pcp_detach_node -p 11001 -w 1
pcp_detach_node -- Command Successful

$ psql -p 11000 -c "show pool_nodes" test
 node_id | hostname | port  | status | pg_status | lb_weight |  role   | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change  
---------+----------+-------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0       | /tmp     | 11002 | up     | up        | 0.333333  | primary | primary | 0          | false             | 0                 |                   |                        | 2021-04-30 15:06:47
 1       | /tmp     | 11003 | up     | up        | 0.333333  | standby | standby | 0          | true              | 0                 | streaming         | async                  | 2021-04-30 15:07:53
 2       | /tmp     | 11004 | up     | up        | 0.333333  | standby | standby | 0          | false             | 0                 | streaming         | async                  | 2021-04-30 15:07:53
(3 rows)

> Looking at logs again, I'm starting to think the original problem may
>> be more complicated. In the attached logging I do not see the node
>> being marked invalid. Instead, I see this:
[snip]
>> This makes me think the sequence of events involves other pgpool nodes:
>> * The instruction to detach primary node 0 is performed on node 0 and
>> forwarded to node 1.
>> * pgpool node 1 starts the failover, promoting backend node 1
>> * pgpool node 2 wrongfully detects a false primary and requests to
>> detach backend node 1
>> * pgpool node 1 accepts this request and starts to detach node 1,
>> while it is in the middle of instructing node 0 and 2 to follow this
>> node as primary
> 
> Thanks for the info. I will look into this.

I started to think that detach_false_primary should not active on
other than leader watchdog node because standby watchdog node can be
interrupted by other watchdog node and causes unexpected failover. I
will investigate more.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp
-------------- next part --------------
A non-text attachment was scrubbed...
Name: detach_false_primary_v2.diff
Type: text/x-patch
Size: 9005 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20210430/78a305da/attachment-0001.bin>