View Issue Details

IDProjectCategoryView StatusLast Update
0000659Pgpool-IIBugpublic2020-12-22 19:53
Reportertmartincpp Assigned Tot-ishii  
PrioritynormalSeverityminorReproducibilityhave not tried
Status feedbackResolutionopen 
Product Version4.1.4 
Summary0000659: Failover issue in replication mode
DescriptionHello !
We use pgpool in replication and load balancing mode with 2 nodes and often we have failover issues with SELECT statements.

Postgresql version si 11.9
.
Here is our configuration:

listen_backlog_multiplier = 2
serialize_accept = on
enable_pool_hba = on

pool_passwd = 'pool_passwd'
authentication_timeout = 60

ssl = off

num_init_children = 280
max_pool = 3

child_life_time = 0
child_max_connections = 0
connection_life_time = 1
client_idle_limit = 0

connection_cache = on
reset_query_list = 'ABORT; DISCARD ALL'


replication_mode = on
replicate_select = off

insert_lock = on
lobj_lock_table = ''

replication_stop_on_mismatch = on

failover_if_affected_tuples_mismatch = off

load_balance_mode = on
ignore_leading_white_space = on
white_function_list = ''
black_function_list = 'nextval,setval'

database_redirect_preference_list = ''

app_name_redirect_preference_list = ''
allow_sql_comments = off

failover_on_backend_error = on

relcache_expire = 0
relcache_size = 256

check_temp_table = on

check_unlogged_table = off

enable_shared_relcache = on


pgpool logs:
Nov 12 23:53:39 host pgpool-II[30574]: [65630-1] LOG: DB node id: 1 backend pid: 12650 statement: SELECT X FROM Y WHERE true ;
Nov 12 23:53:39 host pgpool-II[30574]: [65631-1] LOG: statement: DISCARD ALL
Nov 12 23:53:39 host pgpool-II[30574]: [65632-1] LOG: DB node id: 0 backend pid: 7992 statement: DISCARD ALL
Nov 12 23:53:39 host pgpool-II[30574]: [65633-1] LOG: DB node id: 1 backend pid: 12650 statement: DISCARD ALL
Nov 13 00:00:12 host pgpool-II[30574]: [78709-1] ERROR: unable to write data to frontend
Nov 13 00:00:12 host pgpool-II[30574]: [78709-2] DETAIL: pool_flush failed
Nov 13 00:00:12 host pgpool-II[30574]: [78715-1] FATAL: failed to read kind from backend
Nov 13 00:00:12 host pgpool-II[30574]: [78715-2] DETAIL: kind mismatch among backends. Possible last query was: " DISCARD ALL" kind details are: 0[C] 1[D]
Nov 13 00:00:12 host pgpool-II[30574]: [78715-3] HINT: check data consistency among db nodes
(END)

node0 logs:
2020-11-12 23:53:38.936 CET:10.101.24.11:user@database:[5fadbcf2.1f38]:[7992]: LOG: connection authorized: user=X database=Y
2020-11-12 23:53:39.396 CET:10.101.24.11:user@database:[5fadbcf2.1f38]:[7992]: LOG: duration: 0.087 ms statement: DISCARD ALL
2020-11-12 23:53:40.396 CET:10.101.24.11:user@database:[5fadbcf2.1f38]:[7992]: LOG: disconnection: session time: 0:00:01.461 user=user database=database host=x.x.x.x port=46838

node1 logs:
2020-11-12 23:53:39.018 CET:10.101.24.11:user@database:[5fadbcf2.316a]:[12650]: LOG: duration: 21.532 ms statement: SELECT X FROM Y WHERE true ;
2020-11-12 23:53:39.396 CET:10.101.24.11:user@database:[5fadbcf2.316a]:[12650]: LOG: duration: 0.044 ms statement: DISCARD ALL
2020-11-12 23:53:40.396 CET:10.101.24.11:user@database:[5fadbcf2.316a]:[12650]: LOG: disconnection: session time: 0:00:01.461 user=user database=database host=10.101.24.11 port=32914

SELECT statements are not replicated in configuration so the DISCARD gets different return codes and pgpool degenerate a node.
In this context node1 is queried.

There was no WRITE statements.
I don't understand why pgpool connects to node0 and issues a discard to this node.

It is really strange that the degeneration happens minutes later for the same PID.
It implies that it works well.

I can provide full obfuscated logs if necessary.
TagsNo tags attached.

Activities

t-ishii

2020-11-17 12:02

developer   ~0003597

> There was no WRITE statements.
> I don't understand why pgpool connects to node0 and issues a discard to this node.
The reason why DISCARD is issued is you have it in the reset_query_list. It's perfectly normal.

> It is really strange that the degeneration happens minutes later for the same PID.
Yeah, it's strange. Can you show me how to reliably reproduce the error (failover)?

tmartincpp

2020-11-21 00:25

reporter   ~0003606

Oh I thought the DISCARD command was only sent to all the nodes which sent WRITE queries.

So when we have this log:
Nov 13 00:00:12 host pgpool-II[30574]: [78715-2] DETAIL: kind mismatch among backends. Possible last query was: " DISCARD ALL" kind details are: 0[C] 1[D]

it's not a problem that the kind details are different for each node ?

Otherwise I'm trying to replicate the issue but no success so far.
I'm suspecting a "brutal" disconnection issue causing this behavior ( aprogram not properly closing its connection).

t-ishii

2020-12-22 18:29

developer   ~0003679

Sorry for delay.

> it's not a problem that the kind details are different for each node ?
Yes, it's a problem.

> Can you show me how to reliably reproduce the error (failover)?
Can you share how to reproduce the error?

Issue History

Date Modified Username Field Change
2020-11-14 00:35 tmartincpp New Issue
2020-11-17 10:36 t-ishii Assigned To => t-ishii
2020-11-17 10:36 t-ishii Status new => assigned
2020-11-17 12:02 t-ishii Note Added: 0003597
2020-11-17 12:03 t-ishii Status assigned => feedback
2020-11-21 00:25 tmartincpp Note Added: 0003606
2020-11-21 00:25 tmartincpp Status feedback => assigned
2020-12-22 18:29 t-ishii Note Added: 0003679
2020-12-22 19:53 t-ishii Status assigned => feedback