[pgpool-general: 6986] Re: Inconsistency between DB nodes with native replication

Thu Apr 16 23:52:07 JST 2020

Hi Vladimír,

Since your application uses FOR UPDATE, the cusor is "sensitive",
which means it might fetch new row just committed by other session. If
so, Pgpool-II could detect kind mismatch errors because the number of
rows visible in the table might be different each other on each server
until the other session commit the transaction.

I don't know why ORDER BY could improves the situation. My wild guess
is, the new row just inserted is in the earlier position in the table
and the cursor has already passed the position.

From: Láznička Vladimír <Vladimir.Laznicka at cca.cz>
Subject: RE: [pgpool-general: 6896] Re: Inconsistency between DB nodes with native replication
Date: Thu, 16 Apr 2020 06:53:49 +0000
Message-ID: <2d375392b07441f7af445fa3b2d40cdb at mailer-2.cca.cz>

> Hi Tatsuo,
> 
> It is technically possible that some session might modify the table (INSERTs, UPDATEs...), but it should not modify the recordset which is being FETCHed (unless there's an oversight in the application). Table has an attribute, which contains the "state" of the record and once the record is fully processed by the application (it is in the last state), it can be FETCHed by the cursor in the procedure (the SELECT in the cursor has a specific WHERE clause for it), but then it also shouldn't be touched by any other session.
> 
> Vladimír Láznička
> 
> -----Original Message-----
> From: Tatsuo Ishii <ishii at sraoss.co.jp> 
> Sent: Thursday, April 16, 2020 4:56 AM
> To: Láznička Vladimír <Vladimir.Laznicka at cca.cz>
> Cc: pengbo at sraoss.co.jp; pgpool-general at pgpool.net
> Subject: Re: [pgpool-general: 6896] Re: Inconsistency between DB nodes with native replication
> 
> Hi Vladimír,
> 
> Is it possible that any other session (besides the FETCHing session) modifies the table while it is being FETCHed?
> 
>> Hello,
>> 
>> Sorry for my late reply...
>> 
>> I have checked, whether the number of records in the table between MASTER and SLAVE instances were different, when this error occurred, but it was the same, even the entire content of the tables was the same (byte per byte). However I noticed that when I try the SELECT on each node separately, they sometimes return the records in a different order since I forgot to include ORDER BY clause. After I included it (ORDER BY primary_key_column), the problem seemed to be solved. Is it possible the CURSOR requires that the record set is returned in the same order from both nodes otherwise it fails as if number of records was different?
>> 
>> After this change it seems that pgpool doesn't find any inconsistencies in the result of the SELECT and therefore no failover occurs, so I guess it's probably solved.
>> 
>> Vladimír Láznička
>> 
>> -----Original Message-----
>> From: Tatsuo Ishii <ishii at sraoss.co.jp>
>> Sent: Sunday, April 5, 2020 7:50 AM
>> To: Láznička Vladimír <Vladimir.Laznicka at cca.cz>
>> Cc: pengbo at sraoss.co.jp; pgpool-general at pgpool.net
>> Subject: Re: [pgpool-general: 6896] Re: Inconsistency between DB nodes 
>> with native replication
>> 
>> Hi Vladimir,
>> 
>>> Hello,
>>> 
>>> Hope you are all doing fine. During last few weeks we have not encountered any occurrence of a command pg_terminate_backend() during or before failover situations, there has been another problem however that might not be related to this particular issue but it's caused by a form of a inconsistency between two DB nodes.
>>> 
>>> I will try to explain the situation:
>>> 
>>> In the database, that is being ran on the two replicated nodes, is a table that contains records, which are supposed to be moved to another database (which is single instance) and then deleted depending on values in specific columns every night. 
>>> 
>>> This is being handled via FOREIGN WRAPPER on the single instance database. On this database there is a FOREIGN TABLE that allows access (through pgpool) to aforementioned table on the replicated database via specific user with the needed privileges.
>>> 
>>> The process of moving the records is handled by a procedure called from cron on the single instance database server. The procedure steps are as follows:
>> 
>> Let me confirm your situation.
>> 
>> Host A: which pgpool is running on.
>> Host B: there's a master postgresql managed by the pgpool running on A.
>> Host C: there's a slave postgresql managed by the pgpool running on A.
>> Host D: there's a single postgresql instance on which there's foreign table (let it be "fr1") whose foreign server is A.
>> 
>> while doing below no other transaction touch fr1.
>> 
>>> 1) Opens a CURSOR with SELECT FOR UPDATE on the FOREIGN TABLE (the table on the replicated database).
>> 
>> DECLARE CUSOR is submitted by a frontend connected to D.
>> 
>>> 2) Iterates over the CURSOR results with a FOR LOOP and fetches the records in the RECORD variable.
>> 
>> FETCH is submitted by a frontend connected to D.
>> 
>>> 3) INSERT the values from the record into a table on the single instance database's table.
>> 
>> INSERT is submitted by a frontend connected to D.
>> 
>>> 4) DELETE the inserted record from the FOREIGN TABLE.
>> 
>> DELETE is submitted by a frontend connected to D.
>> 
>>> 5) Repeat until all records from the CURSOR have been processed.
>>> 6) CLOSE the cursor.
>> 
>> 
>>> Now - the problem with this procedure is that it sometimes causes a failover on the pgpool side with these messages:
>> 
>>> Apr  1 03:01:31 AISGPGP01 pgpool[6548]: [30480-2] 2020-04-01 03:01:31
>>> ::: APP - postgres_fdw :A: DB - aisgdvyv :D: USER - aisg_synch :U: 
>>> |DETAIL:  kind mismatch among backends. Possible last query was: 
>>> "FETCH 100 FROM c1" kind details are: 0[D] 1[C]
>> 
>> The error message suggests that the slave has less records than the master in fr1: "D" packet is for row data while "C" packet indicates the completion of the FETCH command. Is it possible that before starting the job number of rows in the slave has been already smaller than the master, or any transaction is running which modify the table?
>> 
>> [snip]
>> 
>>> This packet kind mismatch seems to happen relatively randomly, so I can't reliably test it, but it happens often enough to be a serious problem. From what I saw in the F.A.Q. section on your web pages, SELECT queries from CURSORs are always performed on both DB nodes in the replication mode in the case the SELECT might contain FOR UPDATE clause (which is actually our case). From what the main programmer of the application told me, the FOR UPDATE clause is not necessary here, since the affected rows should not be touched by anything else, when they are considered for the migration to the other database. 
>> 
>> Yes, FOR UPDATE is not neccessary if there's no other transactions which touch fr1.
>> 
>>> So my question is - can I prevent this packet kind mismatch when I don't use the CURSOR and instead I use simple SELECT in the FOR LOOP, like this - https://stackoverflow.com/questions/32668201/postgresql-iterate-over-results-of-query? 
>> 
>> Yes, simple SELECT without FOR UPDATE (or SHARE) runs on either master or slave. (please make sure that the SELECT does not include functions which are regarded as "black function" by Pgpool-II).
>> 
>>> If not, I guess I should disable the replication_stop_on_mismatch parameter in the pgpool.conf file, if I want to prevent the backed failover?
>> 
>> If replication_stop_on_mismatch is turned off, you can avoid failover. 
>> However the session will be disconnected by an error
>> message: "failed to read kind from backend.. check data consistency among db nodes".
>> 
>> 
>>> Thank you for any advice on this issue and if you need any additional information let me know.
>>> 
>>> Vladimír Láznička
>>> 
>>> -----Original Message-----
>>> From: Tatsuo Ishii <ishii at sraoss.co.jp>
>>> Sent: Tuesday, March 17, 2020 3:15 AM
>>> To: Láznička Vladimír <Vladimir.Laznicka at cca.cz>
>>> Cc: pengbo at sraoss.co.jp; pgpool-general at pgpool.net
>>> Subject: Re: [pgpool-general: 6896] Re: Inconsistency between DB 
>>> nodes with native replication
>>> 
>>> Hi Vladimír,
>>> 
>>> Thank you for the feedback.
>>> 
>>>> However, I don't remember ever seeing the line:
>>>> 
>>>> DETAIL:  postmaster on DB node 1 was shutdown by administrative 
>>>> command
>>>> 
>>>> In the log, when my test script failed. I try to watch the log this 
>>>> specifically for this in the coming days and let you know, what I 
>>>> fin
>>> 
>>> I'm looking forward to hearing from you.
>>> 
>>> In the mean time I have pushed a fix for this (see attached emal) in that Pgpool-II will do failover if shutdown event is detected. This should prevent data inconsistency caused by the situation.
>>> 
>>>> Hello,
>>>> 
>>>> I tried to replicated the procedure and pretty much got the same results:
>>>> 
>>>> 1) Connect to the database via pgpool:
>>>> 
>>>> psql -h aisgpgpvip.cca.cz -p 9999 -d aisgdvyv -U aisg
>>>> 
>>>> 2) Run show pool_nodes command, result was as follows:
>>>> 
>>>> aisgdvyv=> show pool_nodes;
>>>>  node_id |     hostname     | port | status | lb_weight |  role  | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
>>>> ---------+------------------+------+--------+-----------+--------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
>>>>  0       | aisgdbd01.cca.cz | 5432 | up     | 0.500000  | master | 1562942    | true              | 0                 |                   |                        | 2020-03-11 14:19:06
>>>>  1       | aisgdbd02.cca.cz | 5432 | up     | 0.500000  | slave  | 567641     | false             | 0                 |                   |                        | 2020-03-16 12:19:08
>>>> 
>>>> 3) Create the test table t2:
>>>> 
>>>> aisgdvyv=> create table t2(i int);
>>>> CREATE TABLE
>>>> 
>>>> Mar 16 12:26:07 AISGPGP01 pgpool[21402]: [3103-1] 2020-03-16 
>>>> 12:26:07
>>>> ::: APP - psql :A: DB - aisgdvyv :D: USER - aisg :U: |LOG:  DB node
>>>> id: 0 backend pid: 17021 statement: BEGIN Mar 16 12:26:07 AISGPGP01
>>>> pgpool[21402]: [3104-1] 2020-03-16 12:26:07 ::: APP - psql :A: DB - 
>>>> aisgdvyv :D: USER - aisg :U: |LOG:  DB node id: 1 backend pid: 19389
>>>> statement: BEGIN Mar 16 12:26:07 AISGPGP01 pgpool[21402]: [3105-1]
>>>> 2020-03-16 12:26:07 ::: APP - psql :A: DB - aisgdvyv :D: USER - aisg
>>>> :U: |LOG:  DB node id: 0 backend pid: 17021 statement: create table 
>>>> t2(i int); Mar 16 12:26:07 AISGPGP01 pgpool[21402]: [3106-1]
>>>> 2020-03-16 12:26:07 ::: APP - psql :A: DB - aisgdvyv :D: USER - aisg
>>>> :U: |LOG:  DB node id: 1 backend pid: 19389 statement: create table 
>>>> t2(i int); Mar 16 12:26:07 AISGPGP01 pgpool[21402]: [3107-1]
>>>> 2020-03-16 12:26:07 ::: APP - psql :A: DB - aisgdvyv :D: USER - aisg
>>>> :U: |LOG:  DB node id: 1 backend pid: 19389 statement: COMMIT Mar 16
>>>> 12:26:07 AISGPGP01 pgpool[21402]: [3108-1] 2020-03-16 12:26:07 ::: 
>>>> APP
>>>> - psql :A: DB - aisgdvyv :D: USER - aisg :U: |LOG:  DB node id: 0 
>>>> backend pid: 17021 statement: COMMIT
>>>> 
>>>> 4) Terminate the session with backend PID 19389 from different session:
>>>> 
>>>> aisgdvyv=> SELECT pg_terminate_backend(19389);  pg_terminate_backend
>>>> ----------------------
>>>>  t
>>>> (1 row)
>>>> 
>>>> Mar 16 12:30:38 AISGPGP01 pgpool[21414]: [3108-1] 2020-03-16 
>>>> 12:30:38
>>>> ::: APP - psql :A: DB - aisgdvyv :D: USER - aisg :U: |LOG:  found 
>>>> the pg_terminate_backend request for backend pid:19389 on backend 
>>>> node:1 Mar 16 12:30:38 AISGPGP01 pgpool[21414]: [3108-2] 2020-03-16 
>>>> 12:30:38
>>>> ::: APP - psql :A: DB - aisgdvyv :D: USER - aisg :U: |DETAIL:  
>>>> setting the connection flag Mar 16 12:30:38 AISGPGP01 pgpool[21414]: 
>>>> [3109-1]
>>>> 2020-03-16 12:30:38 ::: APP - psql :A: DB - aisgdvyv :D: USER - aisg
>>>> :U: |LOG:  DB node id: 1 backend pid: 19715 statement: SELECT 
>>>> pg_terminate_backend(19389); Mar 16 12:30:38 AISGPGP01 pgpool[21402]:
>>>> [3109-1] 2020-03-16 12:30:38 ::: APP - psql :A: DB - aisgdvyv :D: 
>>>> USER
>>>> - aisg :U: |LOG:  reading and processing packets Mar 16 12:30:38
>>>> AISGPGP01 pgpool[21402]: [3109-2] 2020-03-16 12:30:38 ::: APP - psql
>>>> :A: DB - aisgdvyv :D: USER - aisg :U: |DETAIL:  postmaster on DB 
>>>> node
>>>> 1 was shutdown by administrative command
>>>> 
>>>> 5) INSERT a row into the table t2:
>>>> 
>>>> aisgdvyv=> insert into t2 values(1); INSERT 0 1
>>>> 
>>>> Mar 16 12:33:42 AISGPGP01 pgpool[21402]: [3110-1] 2020-03-16 
>>>> 12:33:42
>>>> ::: APP - psql :A: DB - aisgdvyv :D: USER - aisg :U: |LOG:  DB node
>>>> id: 0 backend pid: 17021 statement: SELECT attname, 
>>>> pg_get_expr(d.adbin, d.adrelid), coalesce((pg_get_expr(d.adbin,
>>>> d.adrelid) LIKE '%now()%' OR pg_get_expr(d.adbin, d.adrelid) LIKE 
>>>> '%''now''::text%' OR pg_get_expr(d.adbin, d.adrelid) LIKE 
>>>> '%CURRENT_TIMESTAMP%' OR pg_get_expr(d.adbin, d.adrelid) LIKE 
>>>> '%CURRENT_TIME%' OR pg_get_expr(d.adbin, d.adrelid) LIKE 
>>>> '%CURRENT_DATE%' OR pg_get_expr(d.adbin, d.adrelid) LIKE '%LOCALTIME%'
>>>> OR pg_get_expr(d.adbin, d.adrelid) LIKE '%LOCALTIMESTAMP%') AND 
>>>> (a.atttypid = 'timestamp'::regtype::oid OR a.atttypid = 'timestamp 
>>>> with time zone'::regtype::oid OR a.atttypid = 'date'::regtype::oid 
>>>> OR a.atttypid = 'time'::regtype::oid OR a.atttypid = 'time with time
>>>> zone'::regtype::oid) , false) FROM pg_catalog.pg_class c, 
>>>> pg_catalog.pg_attribute a  LEFT JOIN Mar 16 12:33:42 AISGPGP01
>>>> pgpool[21402]: [3110-2]  pg_catalog.pg_attrdef d ON (a.attrelid = 
>>>> d.adrelid AND a.attnum = d.adnum) WHERE c.oid = a.attrelid AND 
>>>> a.attnum >= 1 AND a.attisdropped = 'f' AND c.oid =
>>>> to_regclass('"t2"') ORDER BY a.attnum Mar 16 12:33:42 AISGPGP01
>>>> pgpool[21402]: [3111-1]
>>>> 2020-03-16 12:33:42 ::: APP - psql :A: DB - aisgdvyv :D: USER - aisg
>>>> :U: |LOG:  DB node id: 0 backend pid: 17021 statement: insert into 
>>>> t2 values(1);
>>>> 
>>>> 6) Compare content of the table between nodes:
>>>> 
>>>> psql -h aisgdbd01.cca.cz -p 5432 -d aisgdvyv -U aisg
>>>> 
>>>> aisgdvyv=> select count(*) from t2;
>>>>  count
>>>> -------
>>>>      1
>>>> (1 row)
>>>> 
>>>> psql -h aisgdbd02.cca.cz -p 5432 -d aisgdvyv -U aisg
>>>> 
>>>> aisgdvyv=> select count(*) from t2;
>>>>  count
>>>> -------
>>>>      0
>>>> (1 row)
>>>> 
>>>> 
>>>> However, I don't remember ever seeing the line:
>>>> 
>>>> DETAIL:  postmaster on DB node 1 was shutdown by administrative 
>>>> command
>>>> 
>>>> In the log, when my test script failed. I try to watch the log this specifically for this in the coming days and let you know, what I find.
>>>> 
>>>> Vladimír Láznička
>>>> 
>>>> -----Original Message-----
>>>> From: Tatsuo Ishii <ishii at sraoss.co.jp>
>>>> Sent: Wednesday, March 11, 2020 11:26 PM
>>>> To: Láznička Vladimír <Vladimir.Laznicka at cca.cz>
>>>> Cc: pengbo at sraoss.co.jp; pgpool-general at pgpool.net
>>>> Subject: Re: [pgpool-general: 6896] Re: Inconsistency between DB 
>>>> nodes with native replication
>>>> 
>>>>> Hello,
>>>>> 
>>>>>> So my question is, did you use pg_terminate_backend() or kill the postgres backend process on node 1 by using kill command?
>>>>> 
>>>>> Definitely not in the case of my test script. I am trying to replicate the process which you described below, but I don't have the load balance node on node 0:
>>>>> 
>>>>> aisgdvyv=> show pool_nodes;
>>>>>  node_id |     hostname     | port | status | lb_weight |  role  | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
>>>>> ---------+------------------+------+--------+-----------+--------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
>>>>>  0       | aisgdbd01.cca.cz | 5432 | up     | 0.500000  | master | 340        | false             | 0                 |                   |                        | 2020-03-11 14:19:06
>>>>>  1       | aisgdbd02.cca.cz | 5432 | up     | 0.500000  | slave  | 749        | true              | 0                 |                   |                        | 2020-03-11 14:19:06
>>>>> 
>>>>> Can I force it or change some settings so I have the load balance node on the node 0?
>>>> 
>>>> The "load_balance_node" column in show pool_nodes command just shows the load balance node in the local session. So next time you start a session "load_balance_node" may show different value because the load balance node is calculated at the start up of each session.
>>>> 
>>>> To force the load balance node on 0, you can set backend_weight1 to 0.
>>>> 
>>>>> Thank you.
>>>>> 
>>>>> Vladimír Láznička
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Tatsuo Ishii <ishii at sraoss.co.jp>
>>>>> Sent: Tuesday, March 10, 2020 1:18 AM
>>>>> To: Láznička Vladimír <Vladimir.Laznicka at cca.cz>
>>>>> Cc: pengbo at sraoss.co.jp; pgpool-general at pgpool.net
>>>>> Subject: Re: [pgpool-general: 6896] Re: Inconsistency between DB 
>>>>> nodes with native replication
>>>>> 
>>>>> I have found a case which could explain your problem. See following Pgpool-II log.
>>>>> 
>>>>> 1. Connect to Pgpool-II using psql. Make sure that load balance node is node 0.
>>>>> 
>>>>> test=# show pool_nodes;
>>>>>  node_id | hostname | port  | status | lb_weight |  role  | 
>>>>> select_cnt
>>>>> | load_balance_node | replication_delay | replication_state |
>>>>> replication_sync_state | last_status_change
>>>>> ---------+----------+-------+--------+-----------+--------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
>>>>>  0       | /tmp     | 11002 | up     | 1.000000  | master | 1          | true              | 0                 |                   |                        | 2020-03-10 09:02:16
>>>>>  1       | /tmp     | 11003 | up     | 0.000000  | slave  | 0          | false             | 0                 |                   |                        | 2020-03-10 09:02:16
>>>>> (2 rows)
>>>>> 
>>>>> 2. CREATE a table "t2".
>>>>> 2020-03-10 09:07:07: pid 12361: LOG:  DB node id: 0 backend pid: 
>>>>> 12459
>>>>> statement: BEGIN
>>>>> 2020-03-10 09:07:07: pid 12361: LOG:  DB node id: 1 backend pid: 
>>>>> 12460
>>>>> statement: BEGIN
>>>>> 2020-03-10 09:07:07: pid 12361: LOG:  DB node id: 0 backend pid: 
>>>>> 12459
>>>>> statement: create table t2(i int);
>>>>> 2020-03-10 09:07:07: pid 12361: LOG:  DB node id: 1 backend pid: 
>>>>> 12460
>>>>> statement: create table t2(i int);
>>>>> 2020-03-10 09:07:07: pid 12361: LOG:  DB node id: 1 backend pid: 
>>>>> 12460
>>>>> statement: COMMIT
>>>>> 2020-03-10 09:07:07: pid 12361: LOG:  DB node id: 0 backend pid: 
>>>>> 12459
>>>>> statement: COMMIT
>>>>> 
>>>>> 3. Issue pg_terminate_backend(12460) to kill the PostgreSQL process on node 1 from different terminal.
>>>>> 
>>>>> 4. Pgpool-II detects it and set the local node 1 status down.
>>>>> 
>>>>> 2020-03-10 09:08:20: pid 12361: LOG:  reading and processing 
>>>>> packets
>>>>> 2020-03-10 09:08:20: pid 12361: DETAIL:  postmaster on DB node 1 
>>>>> was shutdown by administrative command
>>>>> 
>>>>> 5. Issue INSERT to see it is forwarded to only node 0 on the same session as 1 and 2.
>>>>> 
>>>>> 2020-03-10 09:08:36: pid 12361: LOG:  DB node id: 0 backend pid: 
>>>>> 12459
>>>>> statement: SELECT attname, pg_get_expr(d.adbin, d.adrelid), 
>>>>> coalesce((pg_get_expr(d.adbin, d.adrelid) LIKE '%now()%' OR 
>>>>> pg_get_expr(d.adbin, d.adrelid) LIKE '%''now''::text%' OR 
>>>>> pg_get_expr(d.adbin, d.adrelid) LIKE '%CURRENT_TIMESTAMP%' OR 
>>>>> pg_get_expr(d.adbin, d.adrelid) LIKE '%CURRENT_TIME%' OR 
>>>>> pg_get_expr(d.adbin, d.adrelid) LIKE '%CURRENT_DATE%' OR 
>>>>> pg_get_expr(d.adbin, d.adrelid) LIKE '%LOCALTIME%' OR 
>>>>> pg_get_expr(d.adbin, d.adrelid) LIKE '%LOCALTIMESTAMP%') AND 
>>>>> (a.atttypid = 'timestamp'::regtype::oid OR a.atttypid = 'timestamp 
>>>>> with time zone'::regtype::oid OR a.atttypid = 'date'::regtype::oid 
>>>>> OR a.atttypid = 'time'::regtype::oid OR a.atttypid = 'time with 
>>>>> time
>>>>> zone'::regtype::oid) , false) FROM pg_catalog.pg_class c, 
>>>>> pg_catalog.pg_attribute a  LEFT JOIN pg_catalog.pg_attrdef d ON 
>>>>> (a.attrelid = d.adrelid AND a.attnum = d.adnum) WHERE c.oid = 
>>>>> a.attrelid AND a.attnum >= 1 AND a.attisdropped = 'f' AND c.oid =
>>>>> to_regclass('"t2"') ORDER BY a.attnum
>>>>> 2020-03-10 09:08:36: pid 12361: LOG:  DB node id: 0 backend pid: 
>>>>> 12459
>>>>> statement: insert into t2 values(1);
>>>>> 
>>>>> So my question is, did you use pg_terminate_backend() or kill the postgres backend process on node 1 by using kill command?
>>>>> 
>>>>>>> Have you seen any failover event of node 1 and then it failback while the test? If process 29684 copied the down status of node 1 when failover happened but failed to change the local status to "up" when node 1 failed back, then this could be an expiation.
>>>>>> 
>>>>>> Not as far as I remember. If node 1 went down, it would always stay down until I manually used some kind of recovery (as in - I never was anything in the log about failback).
>>>>>> 
>>>>>>> If your test environment's network is more unstable than the data center, it is likely to have more chance of failover. Of course this is just a guess.
>>>>>> 
>>>>>> I asked our sysadmin to move the second DB server (node 1) to the same physical machine in virtualization where is the first DB server (node 0), so I'll see if there is any difference.
>>>>>> 
>>>>>> Vladimír Láznička
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: Tatsuo Ishii <ishii at sraoss.co.jp>
>>>>>> Sent: Thursday, March 5, 2020 1:18 PM
>>>>>> To: Láznička Vladimír <Vladimir.Laznicka at cca.cz>
>>>>>> Cc: pengbo at sraoss.co.jp; pgpool-general at pgpool.net
>>>>>> Subject: Re: [pgpool-general: 6896] Re: Inconsistency between DB 
>>>>>> nodes with native replication
>>>>>> 
>>>>>>> Do you think there is a possibility that only one specific child 
>>>>>>> process of pgpool thinks that one of the DB nodes is dead?
>>>>>> 
>>>>>> Yes. Pgpool-II maintains backend status (up, down ,quarantine etc.) in shared memory and also each pgpool process has local copy of it (usually each pgpool looks into the local copy status). So if for some reason the local status becomes out of sync with the shared memory status, we might see the phenomena.
>>>>>> 
>>>>>> Have you seen any failover event of node 1 and then it failback while the test? If process 29684 copied the down status of node 1 when failover happened but failed to change the local status to "up" when node 1 failed back, then this could be an expiation.
>>>>>> 
>>>>>>> Is there a way to find about it? I guess I could set the logging 
>>>>>>> to some debug level, but I don't even want to imagine the log 
>>>>>>> volume in that case, since it is already hundreds of megabytes in size.
>>>>>> 
>>>>>> As far as I know there's no way to find out the phenomena without turning on debug log. But, yes, it would create huge amount of log.
>>>>>> 
>>>>>>> On the side note I also have the test running on other 
>>>>>>> environment (in customer's data center) and it did not fail so 
>>>>>>> far (since 25th of
>>>>>>> February) so it may come from some specific setting on those 
>>>>>>> servers (pgpool.conf is set the same apart from IP addresses and 
>>>>>>> number of child processes generated at startup).
>>>>>> 
>>>>>> If your test environment's network is more unstable than the data center, it is likely to have more chance of failover. Of course this is just a guess.
>>>>>> 
>>>>>> Best regards,
>>>>>> --
>>>>>> Tatsuo Ishii
>>>>>> SRA OSS, Inc. Japan
>>>>>> English: http://www.sraoss.co.jp/index_en.php
>>>>>> Japanese:http://www.sraoss.co.jp