0000672: Child connection counter issue after the primary node reattached (switch back from secondary)

ID	Project	Category	View Status	Date Submitted	Last Update

0000672	Pgpool-II	Bug	public	2020-12-12 00:40	2021-02-19 09:42

Reporter	wjuszko	Assigned To	t-ishii
Priority	high	Severity	major	Reproducibility	always
Status	closed	Resolution	open
Platform	Linux	OS	Ubuntu	OS Version	20.04
Product Version	4.1.5
Target Version	4.1.6	Fixed in Version	4.1.6

Summary	0000672: Child connection counter issue after the primary node reattached (switch back from secondary)
Description	I managed to setup pgpool2 with two databases (a primary and a secondary). I've noticed that if the primary goes down, the client switches to the secondary (works fine in RO mode), but when primary is up again, and I reattach It to the pool, the pgpool fails to decrease the client connection counter causing "FATAL: sorry, too many clients already" error after a time.
Steps To Reproduce	1, Set up 2 backends (0:ALWAYS_MASTER\|ALLOW_TO_FAILOVER, 1:ALLOW_TO_FAILOVER) and num_init_children = 8 and start pgpool 2, Check everything works fine: - Connect to the pool with psql, and run SELECT 1; - Run pcp_node_info to check the primary backend is up and running (pcp_node_info --user=pcp --host=localhost --port=9898 -vv 0) - Run pcp_node_info to check the secondary backend is up and running (pcp_node_info --user=pcp --host=localhost --port=9898 -vv 1) - If the log level is DEBUG5 the pgpool logs the connection_count_up (number of connected children: 1) 3, Then shut down the primary DB 4, Check the state of the pool: - Run pcp_node_info to check the primary backend is down - Run pcp_node_info to check the secondary backend is still up and running 5, Check the SQL command is still runs on the RO secondary (SELECT 1;) 6, Check the pgpool logs: Neither connection_count_up nor connection_count_down 7, Start the primary DB and reattach it to the pool (with pcp_attach_node) 8, Check the state of the pool: - Run pcp_node_info to check the primary backend is up (or waiting for the first call) - Run pcp_node_info to check the secondary backend is still up and running 9, Check the SQL command runs fine on the primary again (SELECT 1;) 10, Check the pgpool logs: No connection_count_down but connection_count_up (number of connected children: 2) 11, Repeat the steps until the client connection counter reaches the num_init_children value, and the pool refuses the connections with: "FATAL: sorry, too many clients already"
Tags	No tags attached.

t-ishii 2020-12-12 09:44 developer ~0003655 Last edited: 2020-12-12 09:55	Reproduced here. Minimum reproducing procedure is: pgpool_setup # streaming replication, 2 PostgreSQL nodes failback_command = '' # disable failback command search_primary_node_timeout = 5 # optional. to reduce test time ./startall # start whole cluster pg_ctl -D data0 stop # stop primary node and let pgpool failover (in other terminal) psql -p 11000 test select 1; (and leave the psql connecting) pg_ctl -D data0 start # restart primary pcp_attach_node -w -p 11001 0 # re-attach primary (return to the psql terminal) select 1; At this point pgpool child process segfaults and connection counter will never be count down in this session.

t-ishii 2020-12-12 10:05 developer ~0003656	Here is the backtrace: (gdb) bt #0 0x000055aa4c585d41 in query_ps_status (query=query@entry=0x55aa4d414288 "select 1;", backend=backend@entry=0x7f07d69da660) at protocol/pool_process_query.c:3853 0000001 0x000055aa4c59296c in SimpleQuery (frontend=frontend@entry=0x55aa4d40e268, backend=backend@entry=0x7f07d69da660, len=10, contents=contents@entry=0x55aa4d414288 "select 1;") at protocol/pool_proto_modules.c:205 0000002 0x000055aa4c596510 in ProcessFrontendResponse (frontend=frontend@entry=0x55aa4d40e268, backend=backend@entry=0x7f07d69da660) at protocol/pool_proto_modules.c:2664 0000003 0x000055aa4c587f72 in read_packets_and_process (frontend=frontend@entry=0x55aa4d40e268, backend=backend@entry=0x7f07d69da660, reset_request=reset_request@entry=0, state=state@entry=0x7fffeb82106c, num_fields=num_fields@entry=0x7fffeb82106a, cont=cont@entry=0x7fffeb821074 "\001U") at protocol/pool_process_query.c:4956 0000004 0x000055aa4c5887e3 in pool_process_query (frontend=0x55aa4d40e268, backend=0x7f07d69da660, reset_request=reset_request@entry=0) at protocol/pool_process_query.c:291 0000005 0x000055aa4c581420 in do_child (fds=fds@entry=0x55aa4d40d0c0) at protocol/child.c:439 0000006 0x000055aa4c559207 in fork_a_child (fds=0x55aa4d40d0c0, id=26) at main/pgpool_main.c:671 0000007 0x000055aa4c55db33 in failover () at main/pgpool_main.c:1933 0000008 0x000055aa4c55e95b in sigusr1_interupt_processor () at main/pgpool_main.c:1284 0000009 0x000055aa4c560133 in PgpoolMain (discard_status=<optimized out>, clear_memcache_oidmaps=<optimized out>) at main/pgpool_main.c:465 0000010 0x000055aa4c557390 in main (argc=<optimized out>, argv=0x7fffeb82e9b8) at main/main.c:363 query_ps_status(char query, POOL_CONNECTION_POOL backend) { : : sp = MAIN_CONNECTION(backend)->sp; (gdb) p backend $1 = {info = 0x7f07d3d76000, slots = {0x0, 0x7f07d69d4ab0, 0x0 <repeats 126 times>}} So slots[0] is NULL, this is the cause of the segfault. When Pgpool-II performs failback when pcp_attach_node is executed, it should have restarted all child process because backend connection to node 0 is still NULL after previous failover. You can observe this by looking at the log line something like: "Do not restart children because we are failing back node id %d host: %s port: %d and we are in streaming replication mode and not all backends were down" Attached is a patch to fix this by restarting child process if former primary node does not exist. failback.diff (972 bytes) diff --git a/src/main/pgpool_main.c b/src/main/pgpool_main.c index c1529f29..ec45afa3 100644 --- a/src/main/pgpool_main.c +++ b/src/main/pgpool_main.c @@ -1635,7 +1635,17 @@ failover(void) See bug 248 for more details. / - if (STREAM && reqkind == NODE_UP_REQUEST && all_backend_down == false) + / + * We also think about the former primary node did not exist case. In + * this case we need to restart all children as well. For example + * think about the case when previously primary node id is 0 and then + * it went down, restarted, re-attached it without no promotion. Then + * existing child process lose connection slot to node 0 and keep it + * when node 0 comes back. This could result in segfault later on in + * the child process. + / + if (STREAM && reqkind == NODE_UP_REQUEST && all_backend_down == false && + Req_info->primary_node_id >= 0) { / * The decision to restart/no-restart children for update status failback.diff (972 bytes)

t-ishii 2020-12-12 10:12 developer ~0003657	Note that usually : pg_ctl -D data0 stop New primary is elected in the failover process and the issue will not happen. I think that's the reason why it has not been found until today. In another word, the issue only happens when failover script is disabled.

wjuszko 2020-12-13 22:59 reporter ~0003658	Thank You for Your time and help. Unfortunately the fix don't work for me. I'm not sure, but didn't you want checking for Req_info->primary_node_id > 0 instead of Req_info->primary_node_id >= 0 to skip the if block (which sets need_to_restart_children=false) if the primary primary_node_id is 0?

t-ishii 2020-12-14 08:41 developer ~0003659	> Unfortunately the fix don't work for me. Oh I see. Can you share pgpool log when segfault occured? > I'm not sure, but didn't you want checking for > Req_info->primary_node_id > 0 > instead of > Req_info->primary_node_id >= 0 > to skip the if block (which sets need_to_restart_children=false) if the primary primary_node_id is 0? The intention of the patch is to catch the case when former primary node does not exist, and in the case Req_info->primary_node_id is -1. In my test case it does work but in your case it does not. I suspect your issue is caused by a different cause. I hope I could find it from your pgpool log.

wjuszko 2020-12-14 19:16 reporter	pgppol2.log (1,325,243 bytes)

wjuszko 2020-12-14 19:16 reporter ~0003660	After the patch I checked the value of primary_node_id and it was 0.. I thought that means the primary node is down. I see now. This is my pgpool conf and log (level DEBUG5) before the patch (and docker-compose to setup databases). I did the steps until "FATAL: sorry, too many clients already" message. Please let me now if You need anything else. pgpool.conf (43,572 bytes) docker-compose.yml (1,022 bytes)

t-ishii 2020-12-14 19:39 developer ~0003661	Thank you for the additional information. I seemed to miss you set "ALWAYS_MASTER" flag to node 0. Let me investigate more.

t-ishii 2020-12-14 21:17 developer ~0003662	Ok, it appears that if you set ALWAYS_MASTER flag, then you need to set DISALLOW_TO_FAILOVER as well, i.e.: backend_flag0 = 'ALWAYS_PRIMARY\|DISALLOW_TO_FAILOVER' Otherwise, pgpool is happy to disconnect primary (node 0) with setting primary node id to 0. Later on pgpool child accesses node 0 if they need to access to the primary node, and segfaults. Probably pgpool need to check at the start up: if ALWAYS_PRIMARY is set and DISALLOW_TO_FAILOVER is not set. If not, pgpool should not start up, I think.

wjuszko 2020-12-14 23:32 reporter ~0003663	Sorry I'm getting lost. Here is my full story: Some years ago (0000396:0002014 , version 3.1.1 I guess) I managed to set this architecture and behavior thanks to pgpool2: - machine 1: primary RW DB - machine 2 : a secondary RO DB with pgpool and my app - machine 3 : a secondary RO DB with pgpool and my app My both apps (on machine 2 and 3) worked on the primary DB (on machin 1) normally. When something happened and lost connection to the primary, pgpool2 switched to the secondary node (that was running on the same machine) and my app could continue work without noticing the change (write operations failed, but I could deal with it because there were mostly reads). I checked by a cron job if the master was back and attached to the pool again. It worked fine. I'm trying to do the same now, with pgppol2 version 4.1.5, but no luck. The similar almost works with: backend_flag0 = 'ALWAYS_MASTER\|ALLOW_TO_FAILOVER' But after some reattachment I get the "FATAL: sorry, too many clients already" message. If I understand You right ALWAYS_MASTER and ALLOW_TO_FAILOVER should not to be set at the same time, but If I set backend_flag0 = 'ALWAYS_MASTER\|DISALLOW_TO_FAILOVER' pgpool will refuses to connect even if the secondary node is available If I set backend_flag0 = 'ALLOW_TO_FAILOVER' my client timeouts. So these are not an option for me. Maybe I miss something. Please suggest. I attached the old v3.1.1 pgpool.conf pgpool-2.conf (19,530 bytes)

t-ishii 2020-12-15 09:34 developer ~0003664 Last edited: 2020-12-15 09:35	Ok, after working a little bit harder, I think I found a solution with "backend_flag0 = 'ALWAYS_MASTER\|ALLOW_TO_FAILOVER'". In the previous patch: if (STREAM && reqkind == NODE_UP_REQUEST && all_backend_down == false && Req_info->primary_node_id >= 0) I added "&& Req_info->primary_node_id != node_id": if (STREAM && reqkind == NODE_UP_REQUEST && all_backend_down == false && Req_info->primary_node_id >= 0 && Req_info->primary_node_id != node_id) The intention of this is, let pgpool restart all children if requested fail back node (node_id) is same as the previous primary node (Req_info->primary_node_id ) . Can you please try this out?

wjuszko 2020-12-15 18:04 reporter ~0003665	Thanks! It works great. Will You release it with 4.1.6?

t-ishii 2020-12-15 20:02 developer ~0003666	Yes! It is supposed to be released on February 18, 2021. https://pgpool.net/mediawiki/index.php/Roadmap

Date Modified	Username	Field	Change
2020-12-12 00:40	wjuszko	New Issue
2020-12-12 09:33	t-ishii	Assigned To	=> t-ishii
2020-12-12 09:33	t-ishii	Status	new => assigned
2020-12-12 09:44	t-ishii	Note Added: 0003655
2020-12-12 09:55	t-ishii	Note Edited: 0003655
2020-12-12 10:05	t-ishii	File Added: failback.diff
2020-12-12 10:05	t-ishii	Note Added: 0003656
2020-12-12 10:12	t-ishii	Note Added: 0003657
2020-12-12 11:05	t-ishii	Status	assigned => feedback
2020-12-12 11:05	t-ishii	Target Version	=> 4.1.6
2020-12-13 22:59	wjuszko	Note Added: 0003658
2020-12-13 22:59	wjuszko	Status	feedback => assigned
2020-12-14 08:41	t-ishii	Note Added: 0003659
2020-12-14 08:41	t-ishii	Status	assigned => feedback
2020-12-14 19:16	wjuszko	File Added: pgppol2.log
2020-12-14 19:16	wjuszko	File Added: pgpool.conf
2020-12-14 19:16	wjuszko	File Added: docker-compose.yml
2020-12-14 19:16	wjuszko	Note Added: 0003660
2020-12-14 19:16	wjuszko	Status	feedback => assigned
2020-12-14 19:39	t-ishii	Note Added: 0003661
2020-12-14 21:17	t-ishii	Note Added: 0003662
2020-12-14 21:18	t-ishii	Status	assigned => feedback
2020-12-14 23:32	wjuszko	File Added: pgpool-2.conf
2020-12-14 23:32	wjuszko	Note Added: 0003663
2020-12-14 23:32	wjuszko	Status	feedback => assigned
2020-12-15 09:34	t-ishii	Note Added: 0003664
2020-12-15 09:34	t-ishii	Note Edited: 0003664
2020-12-15 09:35	t-ishii	Note Edited: 0003664
2020-12-15 09:35	t-ishii	Status	assigned => feedback
2020-12-15 18:04	wjuszko	Note Added: 0003665
2020-12-15 18:04	wjuszko	Status	feedback => assigned
2020-12-15 20:02	t-ishii	Note Added: 0003666
2020-12-17 06:59	t-ishii	Status	assigned => resolved
2021-02-19 09:42	administrator	Status	resolved => closed
2021-02-19 09:42	administrator	Fixed in Version	=> 4.1.6

View Issue Details

Activities

Issue History