[pgpool-general: 6913] Re: Inconsistency between DB nodes with native replication

Láznička Vladimír Vladimir.Laznicka at cca.cz
Thu Mar 5 19:00:16 JST 2020


Hello,

As for the process 29684, there are no other actions mentioned in the log, only those:

Mar  2 13:29:01 AISGPGP01 pgpool[29684]: [102356-1] 2020-03-02 13:29:01 ::: APP - [No Connection] :A: DB - [No Connection] :D: USER - [No Connection] :U: |LOG:  new connection received
Mar  2 13:29:01 AISGPGP01 pgpool[29684]: [102356-2] 2020-03-02 13:29:01 ::: APP - [No Connection] :A: DB - [No Connection] :D: USER - [No Connection] :U: |DETAIL:  connecting host=dbstest.cca.cz port=57116
Mar  2 13:29:02 AISGPGP01 pgpool[29684]: [102357-1] 2020-03-02 13:29:02 ::: APP - psql :A: DB - aisgdvyv :D: USER - aisg :U: |LOG:  DB node id: 0 backend pid: 27333 statement: SELECT attname, pg_get_expr(d.adbin, d.adrelid), coalesce((pg_get_expr(d.adbin, d.adrelid) LIKE '%now()%' OR pg_get_expr(d.adbin, d.adrelid) LIKE '%''now''::text%' OR pg_get_expr(d.adbin, d.adrelid) LIKE '%CURRENT_TIMESTAMP%' OR pg_get_expr(d.adbin, d.adrelid) LIKE '%CURRENT_TIME%' OR pg_get_expr(d.adbin, d.adrelid) LIKE '%CURRENT_DATE%' OR pg_get_expr(d.adbin, d.adrelid) LIKE '%LOCALTIME%' OR pg_get_expr(d.adbin, d.adrelid) LIKE '%LOCALTIMESTAMP%') AND (a.atttypid = 'timestamp'::regtype::oid OR a.atttypid = 'timestamp with time zone'::regtype::oid OR a.atttypid = 'date'::regtype::oid OR a.atttypid = 'time'::regtype::oid OR a.atttypid = 'time with time zone'::regtype::oid) , false) FROM pg_catalog.pg_class c, pg_catalog.pg_attribute a  LEFT JOIN
Mar  2 13:29:02 AISGPGP01 pgpool[29684]: [102357-2]  pg_catalog.pg_attrdef d ON (a.attrelid = d.adrelid AND a.attnum = d.adnum) WHERE c.oid = a.attrelid AND a.attnum >= 1 AND a.attisdropped = 'f' AND c.oid = to_regclass('"replication_test"') ORDER BY a.attnum
Mar  2 13:29:02 AISGPGP01 pgpool[29684]: [102358-1] 2020-03-02 13:29:02 ::: APP - psql :A: DB - aisgdvyv :D: USER - aisg :U: |LOG:  DB node id: 0 backend pid: 27333 statement: INSERT INTO "replication_test"("pk_id", "int_number", "custom_text", "custom_date") VALUES (2,"pg_catalog"."date_part"('dow',"pg_catalog"."timestamptz"('2020-03-02 13:29:02.016628+01'::text)), ('Current date/time is '||"pg_catalog"."timestamptz"('2020-03-02 13:29:02.016628+01'::text) ),"pg_catalog"."timestamptz"('2020-03-02 13:29:02.016628+01'::text))
Mar  2 13:29:02 AISGPGP01 pgpool[29684]: [102359-1] 2020-03-02 13:29:02 ::: APP - psql :A: DB - aisgdvyv :D: USER - aisg :U: |LOG:  DB node id: 0 backend pid: 27333 statement:  DISCARD ALL

Do you think there is a possibility that only one specific child process of pgpool thinks that one of the DB nodes is dead? Is there a way to find about it? I guess I could set the logging to some debug level, but I don't even want to imagine the log volume in that case, since it is already hundreds of megabytes in size. 

On the side note I also have the test running on other environment (in customer's data center) and it did not fail so far (since 25th of February) so it may come from some specific setting on those servers (pgpool.conf is set the same apart from IP addresses and number of child processes generated at startup).

Vladimír Láznička
 

-----Original Message-----
From: Tatsuo Ishii <ishii at sraoss.co.jp> 
Sent: Thursday, March 5, 2020 10:38 AM
To: Láznička Vladimír <Vladimir.Laznicka at cca.cz>
Cc: pengbo at sraoss.co.jp; pgpool-general at pgpool.net
Subject: Re: [pgpool-general: 6896] Re: Inconsistency between DB nodes with native replication

Hello,

> Hello,
> 
> After some further testing it seems that the source of our issue is somewhere else than in one of the nodes becoming quarantined. After applying the latest patch, the test script managed to create inconsistency after a few days of running. Based on that I decided to change the configuration in pgpool.conf so it will degenerate one of the nodes in case inconsistency is found by turning on these parameters:
> 
> replication_stop_on_mismatch = on
> failover_if_affected_tuples_mismatch = on
> 
> Since some of our developers are using tools, which send SELECT queries with comments (such as pgAdmin), I also had to turn on this parameter in load-balancing section, so it doesn't cause a backend failover in the case it queries something like a DB object by OID (which can differ between the nodes):
> 
> allow_sql_comments = on
> 
> So far it seems to be doing what I would expect - inconsistency happens, next UPDATE or DELETE discovers the fact, pgPool degenerates the second (non-master) node and next commands proceed normally on the remaining DB node. I can then run Online Recovery on the degenerated node (which seems to be working without an issue). I have a few questions though:
> 
> The  replication_stop_on_mismatch covers all the situations as failover_if_affected_tuples_mismatch does plus some others or do I need to have both of these parameters turned on, if I want to cover any possible case of inconsistency? 

Those parameters cover different
situation. replication_stop_on_mismatch covers the case the packets and their kind are not same on DB nodes. For example, if SELECT is sent to all nodes and some of returns different number of rows, replication_stop_on_mismatch will detect it since the least row returning node will then send "Command Complete" packet which is different from the packet for row data.

failover_if_affected_tuples_mismatch covers when DML
(update/insert/delete) returns different number of affected rows. In this case the number of packets and kind is same but the contents (the number of affected rows is deffer).

So if you want the maximum protection, you would want to turn on both parameters.

> From what I saw so far, it seems that the pgPool always degenerates the second (non-master) node, which actually works well for our case, since it seems the command, which causes the inconsistency is always skipping the second node. Can we always expect this behavior in the case we have two DB nodes? 

Pgpool does voting when those situations above happens to choose the majority nodes. However if there are even number of DB nodes (e.g. 2,
4 ...) it is possible that no one can get the majority. In this case Pgpool regards the master node is correct. If there are only 2 DB nodes, apparently no one gets the majority and always the master wins.

> As for the testing of a possible inconsistency with my script...
> 
> Since the change in the pgpool.conf I changed the script, so that it doesn't send the control SELECT with a comment to the pgPool service, since it would just get load-balanced now, but I connect directly to each of the DB nodes and compare the result of the SELECT between them. Last case of inconsistency happened yesterday, I am sending the specific chunk of the log as an attachment. From what I can tell, it happens the same way as before - one of the INSERT commands gets performed on the master node (node 0) only, so each node has a different number of rows in the table, which gets picked up by the next UPDATE command. However, what I also noticed - in the case that said INSERT is performed on both nodes, there is also a SELECT on pg_catalog.pg_attrdef and pg_catalog.pg_class beforehand (on master node):
> 
> SELECT count(*) FROM pg_catalog.pg_attrdef AS d, pg_catalog.pg_class 
> AS c WHERE d.adrelid = c.oid AND pg_get_expr(d.adbin, d.adrelid) ~ 
> 'nextval' AND c.oid = pg_catalog.to_regclass('"replication_test"')
> 
> This SELECT, however, is not present in the case the INSERT is not performed on the second (non-master) node. Do you think this could be some kind of lead to where the issue comes from? Please, let me know in the case you have any idea where to look next for the source of the issue. I am also sending the latest pgpool.conf as an attachment in case there is some parameter setup that could be the cause of the behavior after the latest patch.

Not sure at this point. However I noticed that the pgpool process
(29684) sends DISCARD command to only 1 node. This is unusual since pgpool should send DISCARD at the end of session to all DB nodes. The only explanation is either 1) process 29684 believed that there is only 1 DB node 2) node 1 is dead. I still think about the possibilty that nodes 1 status was dead (or qurantine) at the time when the process started to process clients request (INSERT) from process 29684's view. Do you have more log of process 29684 before this?

Mar  2 13:29:01 AISGPGP01 pgpool[29684]: [102356-1] 2020-03-02 13:29:01 ::: APP - [No Connection] :A: DB - [No Connection] :D: USER - [No Connection] :U: |LOG:  new connection received Mar  2 13:29:01 AISGPGP01 pgpool[29684]: [102356-2] 2020-03-02 13:29:01 ::: APP - [No Connection] :A: DB - [No Connection] :D: USER - [No Connection] :U: |DETAIL:  connecting host=dbstest.cca.cz port=57116 Mar  2 13:29:02 AISGPGP01 pgpool[29684]: [102357-1] 2020-03-02 13:29:02 ::: APP - psql :A: DB - aisgdvyv :D: USER - aisg :U: |LOG:  DB node id: 0 backend pid: 27333 statement: SELECT attname, pg_get_expr(d.adbin, d.adrelid), coalesce((pg_get_expr(d.adbin, d.adrelid) LIKE '%now()%' OR pg_get_expr(d.adbin, d.adrelid) LIKE '%''now''::text%' OR pg_get_expr(d.adbin, d.adrelid) LIKE '%CURRENT_TIMESTAMP%' OR pg_get_expr(d.adbin, d.adrelid) LIKE '%CURRENT_TIME%' OR pg_get_expr(d.adbin, d.adrelid) LIKE '%CURRENT_DATE%' OR pg_get_expr(d.adbin, d.adrelid) LIKE '%LOCALTIME%' OR pg_get_expr(d.adbin, d.adrelid) LIKE '%LOCALTIMESTAMP%') AND (a.atttypid = 'timestamp'::regtype::oid OR a.atttypid = 'timestamp with time zone'::regtype::oid OR a.atttypid = 'date'::regtype::oid OR a.atttypid = 'time'::regtype::oid OR a.atttypid = 'time with time zone'::regtype::oid) , false) FROM pg_catalog.pg_class c, pg_catalog.pg_attribute a  LEFT JOIN Mar  2 13:29:02 AISGPGP01 pgpool[29684]: [102357-2]  pg_catalog.pg_attrdef d ON (a.attrelid = d.adrelid AND a.attnum = d.adnum) WHERE c.oid = a.attrelid AND a.attnum >= 1 AND a.attisdropped = 'f' AND c.oid = to_regclass('"replication_test"') ORDER BY a.attnum Mar  2 13:29:02 AISGPGP01 pgpool[29684]: [102358-1] 2020-03-02 13:29:02 ::: APP - psql :A: DB - aisgdvyv :D: USER - aisg :U: |LOG:  DB node id: 0 backend pid: 27333 statement: INSERT INTO "replication_test"("pk_id", "int_number", "custom_text", "custom_date") VALUES (2,"pg_catalog"."date_part"('dow',"pg_catalog"."timestamptz"('2020-03-02 13:29:02.016628+01'::text)), ('Current date/time is '||"pg_catalog"."timestamptz"('2020-03-02 13:29:02.016628+01'::text) ),"pg_catalog"."timestamptz"('2020-03-02 13:29:02.016628+01'::text)) Mar  2 13:29:02 AISGPGP01 pgpool[29684]: [102359-1] 2020-03-02 13:29:02 ::: APP - psql :A: DB - aisgdvyv :D: USER - aisg :U: |LOG:  DB node id: 0 backend pid: 27333 statement:  DISCARD ALL

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


More information about the pgpool-general mailing list