0000659: Failover issue in replication mode - Pgpool-II Bug Tracker

ID	Project	Category	View Status	Date Submitted	Last Update

0000659	Pgpool-II	Bug	public	2020-11-14 00:35	2020-12-22 19:53

Reporter	tmartincpp	Assigned To	t-ishii
Priority	normal	Severity	minor	Reproducibility	have not tried
Status	feedback	Resolution	open
Product Version	4.1.4

Summary	0000659: Failover issue in replication mode
Description	Hello ! We use pgpool in replication and load balancing mode with 2 nodes and often we have failover issues with SELECT statements. Postgresql version si 11.9 . Here is our configuration: listen_backlog_multiplier = 2 serialize_accept = on enable_pool_hba = on pool_passwd = 'pool_passwd' authentication_timeout = 60 ssl = off num_init_children = 280 max_pool = 3 child_life_time = 0 child_max_connections = 0 connection_life_time = 1 client_idle_limit = 0 connection_cache = on reset_query_list = 'ABORT; DISCARD ALL' replication_mode = on replicate_select = off insert_lock = on lobj_lock_table = '' replication_stop_on_mismatch = on failover_if_affected_tuples_mismatch = off load_balance_mode = on ignore_leading_white_space = on white_function_list = '' black_function_list = 'nextval,setval' database_redirect_preference_list = '' app_name_redirect_preference_list = '' allow_sql_comments = off failover_on_backend_error = on relcache_expire = 0 relcache_size = 256 check_temp_table = on check_unlogged_table = off enable_shared_relcache = on pgpool logs: Nov 12 23:53:39 host pgpool-II[30574]: [65630-1] LOG: DB node id: 1 backend pid: 12650 statement: SELECT X FROM Y WHERE true ; Nov 12 23:53:39 host pgpool-II[30574]: [65631-1] LOG: statement: DISCARD ALL Nov 12 23:53:39 host pgpool-II[30574]: [65632-1] LOG: DB node id: 0 backend pid: 7992 statement: DISCARD ALL Nov 12 23:53:39 host pgpool-II[30574]: [65633-1] LOG: DB node id: 1 backend pid: 12650 statement: DISCARD ALL Nov 13 00:00:12 host pgpool-II[30574]: [78709-1] ERROR: unable to write data to frontend Nov 13 00:00:12 host pgpool-II[30574]: [78709-2] DETAIL: pool_flush failed Nov 13 00:00:12 host pgpool-II[30574]: [78715-1] FATAL: failed to read kind from backend Nov 13 00:00:12 host pgpool-II[30574]: [78715-2] DETAIL: kind mismatch among backends. Possible last query was: " DISCARD ALL" kind details are: 0[C] 1[D] Nov 13 00:00:12 host pgpool-II[30574]: [78715-3] HINT: check data consistency among db nodes (END) node0 logs: 2020-11-12 23:53:38.936 CET:10.101.24.11:user@database:[5fadbcf2.1f38]:[7992]: LOG: connection authorized: user=X database=Y 2020-11-12 23:53:39.396 CET:10.101.24.11:user@database:[5fadbcf2.1f38]:[7992]: LOG: duration: 0.087 ms statement: DISCARD ALL 2020-11-12 23:53:40.396 CET:10.101.24.11:user@database:[5fadbcf2.1f38]:[7992]: LOG: disconnection: session time: 0:00:01.461 user=user database=database host=x.x.x.x port=46838 node1 logs: 2020-11-12 23:53:39.018 CET:10.101.24.11:user@database:[5fadbcf2.316a]:[12650]: LOG: duration: 21.532 ms statement: SELECT X FROM Y WHERE true ; 2020-11-12 23:53:39.396 CET:10.101.24.11:user@database:[5fadbcf2.316a]:[12650]: LOG: duration: 0.044 ms statement: DISCARD ALL 2020-11-12 23:53:40.396 CET:10.101.24.11:user@database:[5fadbcf2.316a]:[12650]: LOG: disconnection: session time: 0:00:01.461 user=user database=database host=10.101.24.11 port=32914 SELECT statements are not replicated in configuration so the DISCARD gets different return codes and pgpool degenerate a node. In this context node1 is queried. There was no WRITE statements. I don't understand why pgpool connects to node0 and issues a discard to this node. It is really strange that the degeneration happens minutes later for the same PID. It implies that it works well. I can provide full obfuscated logs if necessary.
Tags	No tags attached.

t-ishii 2020-11-17 12:02 developer ~0003597	> There was no WRITE statements. > I don't understand why pgpool connects to node0 and issues a discard to this node. The reason why DISCARD is issued is you have it in the reset_query_list. It's perfectly normal. > It is really strange that the degeneration happens minutes later for the same PID. Yeah, it's strange. Can you show me how to reliably reproduce the error (failover)?

tmartincpp 2020-11-21 00:25 reporter ~0003606	Oh I thought the DISCARD command was only sent to all the nodes which sent WRITE queries. So when we have this log: Nov 13 00:00:12 host pgpool-II[30574]: [78715-2] DETAIL: kind mismatch among backends. Possible last query was: " DISCARD ALL" kind details are: 0[C] 1[D] it's not a problem that the kind details are different for each node ? Otherwise I'm trying to replicate the issue but no success so far. I'm suspecting a "brutal" disconnection issue causing this behavior ( aprogram not properly closing its connection).

t-ishii 2020-12-22 18:29 developer ~0003679	Sorry for delay. > it's not a problem that the kind details are different for each node ? Yes, it's a problem. > Can you show me how to reliably reproduce the error (failover)? Can you share how to reproduce the error?

Date Modified	Username	Field	Change
2020-11-14 00:35	tmartincpp	New Issue
2020-11-17 10:36	t-ishii	Assigned To	=> t-ishii
2020-11-17 10:36	t-ishii	Status	new => assigned
2020-11-17 12:02	t-ishii	Note Added: 0003597
2020-11-17 12:03	t-ishii	Status	assigned => feedback
2020-11-21 00:25	tmartincpp	Note Added: 0003606
2020-11-21 00:25	tmartincpp	Status	feedback => assigned
2020-12-22 18:29	t-ishii	Note Added: 0003679
2020-12-22 19:53	t-ishii	Status	assigned => feedback