View Issue Details

IDProjectCategoryView StatusLast Update
0000672Pgpool-IIBugpublic2021-02-19 09:42
Reporterwjuszko Assigned Tot-ishii  
PriorityhighSeveritymajorReproducibilityalways
Status closedResolutionopen 
PlatformLinuxOSUbuntuOS Version20.04
Product Version4.1.5 
Target Version4.1.6Fixed in Version4.1.6 
Summary0000672: Child connection counter issue after the primary node reattached (switch back from secondary)
DescriptionI managed to setup pgpool2 with two databases (a primary and a secondary). I've noticed that if the primary goes down, the client switches to the secondary (works fine in RO mode), but when primary is up again, and I reattach It to the pool, the pgpool fails to decrease the client connection counter causing "FATAL: sorry, too many clients already" error after a time.
Steps To Reproduce1, Set up 2 backends (0:ALWAYS_MASTER|ALLOW_TO_FAILOVER, 1:ALLOW_TO_FAILOVER) and num_init_children = 8 and start pgpool
2, Check everything works fine:
     - Connect to the pool with psql, and run SELECT 1;
     - Run pcp_node_info to check the primary backend is up and running (pcp_node_info --user=pcp --host=localhost --port=9898 -vv 0)
     - Run pcp_node_info to check the secondary backend is up and running (pcp_node_info --user=pcp --host=localhost --port=9898 -vv 1)
     - If the log level is DEBUG5 the pgpool logs the connection_count_up (number of connected children: 1)

3, Then shut down the primary DB
4, Check the state of the pool:
     - Run pcp_node_info to check the primary backend is down
     - Run pcp_node_info to check the secondary backend is still up and running
5, Check the SQL command is still runs on the RO secondary (SELECT 1;)
6, Check the pgpool logs: Neither connection_count_up nor connection_count_down

7, Start the primary DB and reattach it to the pool (with pcp_attach_node)
8, Check the state of the pool:
     - Run pcp_node_info to check the primary backend is up (or waiting for the first call)
     - Run pcp_node_info to check the secondary backend is still up and running
9, Check the SQL command runs fine on the primary again (SELECT 1;)
10, Check the pgpool logs: No connection_count_down but connection_count_up (number of connected children: 2)

11, Repeat the steps until the client connection counter reaches the num_init_children value, and the pool refuses the connections with: "FATAL: sorry, too many clients already"
TagsNo tags attached.

Activities

t-ishii

2020-12-12 09:44

developer   ~0003655

Last edited: 2020-12-12 09:55

Reproduced here. Minimum reproducing procedure is:
pgpool_setup # streaming replication, 2 PostgreSQL nodes
failback_command = '' # disable failback command
search_primary_node_timeout = 5 # optional. to reduce test time
./startall # start whole cluster
pg_ctl -D data0 stop # stop primary node and let pgpool failover
(in other terminal)
psql -p 11000 test
select 1;
(and leave the psql connecting)
pg_ctl -D data0 start # restart primary
pcp_attach_node -w -p 11001 0 # re-attach primary
(return to the psql terminal)
select 1;
At this point pgpool child process segfaults and connection counter will never be count down in this session.

t-ishii

2020-12-12 10:05

developer   ~0003656

Here is the backtrace:
(gdb) bt
#0 0x000055aa4c585d41 in query_ps_status (query=query@entry=0x55aa4d414288 "select 1;", backend=backend@entry=0x7f07d69da660)
    at protocol/pool_process_query.c:3853
0000001 0x000055aa4c59296c in SimpleQuery (frontend=frontend@entry=0x55aa4d40e268, backend=backend@entry=0x7f07d69da660, len=10,
    contents=contents@entry=0x55aa4d414288 "select 1;") at protocol/pool_proto_modules.c:205
0000002 0x000055aa4c596510 in ProcessFrontendResponse (frontend=frontend@entry=0x55aa4d40e268, backend=backend@entry=0x7f07d69da660)
    at protocol/pool_proto_modules.c:2664
0000003 0x000055aa4c587f72 in read_packets_and_process (frontend=frontend@entry=0x55aa4d40e268, backend=backend@entry=0x7f07d69da660,
    reset_request=reset_request@entry=0, state=state@entry=0x7fffeb82106c, num_fields=num_fields@entry=0x7fffeb82106a,
    cont=cont@entry=0x7fffeb821074 "\001U") at protocol/pool_process_query.c:4956
0000004 0x000055aa4c5887e3 in pool_process_query (frontend=0x55aa4d40e268, backend=0x7f07d69da660, reset_request=reset_request@entry=0)
    at protocol/pool_process_query.c:291
0000005 0x000055aa4c581420 in do_child (fds=fds@entry=0x55aa4d40d0c0) at protocol/child.c:439
0000006 0x000055aa4c559207 in fork_a_child (fds=0x55aa4d40d0c0, id=26) at main/pgpool_main.c:671
0000007 0x000055aa4c55db33 in failover () at main/pgpool_main.c:1933
0000008 0x000055aa4c55e95b in sigusr1_interupt_processor () at main/pgpool_main.c:1284
0000009 0x000055aa4c560133 in PgpoolMain (discard_status=<optimized out>, clear_memcache_oidmaps=<optimized out>) at main/pgpool_main.c:465
0000010 0x000055aa4c557390 in main (argc=<optimized out>, argv=0x7fffeb82e9b8) at main/main.c:363

query_ps_status(char *query, POOL_CONNECTION_POOL * backend)
{
:
:
      sp = MAIN_CONNECTION(backend)->sp;

(gdb) p *backend
$1 = {info = 0x7f07d3d76000, slots = {0x0, 0x7f07d69d4ab0, 0x0 <repeats 126 times>}}

So slots[0] is NULL, this is the cause of the segfault.

When Pgpool-II performs failback when pcp_attach_node is executed, it should have restarted all child process because backend connection to node 0 is still NULL after previous failover. You can observe this by looking at the log line something like:
"Do not restart children because we are failing back node id %d host: %s port: %d and we are in streaming replication mode and not all backends were down"

Attached is a patch to fix this by restarting child process if former primary node does not exist.
failback.diff (972 bytes)   
diff --git a/src/main/pgpool_main.c b/src/main/pgpool_main.c
index c1529f29..ec45afa3 100644
--- a/src/main/pgpool_main.c
+++ b/src/main/pgpool_main.c
@@ -1635,7 +1635,17 @@ failover(void)
 		 * See bug 248 for more details.
 		 */
 
-		if (STREAM && reqkind == NODE_UP_REQUEST && all_backend_down == false)
+		/*
+		 * We also think about the former primary node did not exist case.  In
+		 * this case we need to restart all children as well. For example
+		 * think about the case when previously primary node id is 0 and then
+		 * it went down, restarted, re-attached it without no promotion. Then
+		 * existing child process lose connection slot to node 0 and keep it
+		 * when node 0 comes back. This could result in segfault later on in
+		 * the child process.
+		 */
+		if (STREAM && reqkind == NODE_UP_REQUEST && all_backend_down == false &&
+			Req_info->primary_node_id >= 0)
 		{
 			/*
 			 * The decision to restart/no-restart children for update status
failback.diff (972 bytes)   

t-ishii

2020-12-12 10:12

developer   ~0003657

Note that usually :
pg_ctl -D data0 stop
New primary is elected in the failover process and the issue will not happen. I think that's the reason why it has not been found until today. In another word, the issue only happens when failover script is disabled.

wjuszko

2020-12-13 22:59

reporter   ~0003658

Thank You for Your time and help.

Unfortunately the fix don't work for me. I'm not sure, but didn't you want checking for
  Req_info->primary_node_id > 0
instead of
  Req_info->primary_node_id >= 0
to skip the if block (which sets need_to_restart_children=false) if the primary primary_node_id is 0?

t-ishii

2020-12-14 08:41

developer   ~0003659

> Unfortunately the fix don't work for me.
Oh I see. Can you share pgpool log when segfault occured?

> I'm not sure, but didn't you want checking for
> Req_info->primary_node_id > 0
> instead of
> Req_info->primary_node_id >= 0
> to skip the if block (which sets need_to_restart_children=false) if the primary primary_node_id is 0?
The intention of the patch is to catch the case when former primary node does not exist, and in the case
Req_info->primary_node_id is -1. In my test case it does work but in your case it does not. I suspect your issue is caused by a different cause. I hope I could find it from your pgpool log.

wjuszko

2020-12-14 19:16

reporter  

pgppol2.log (1,325,243 bytes)

wjuszko

2020-12-14 19:16

reporter   ~0003660

After the patch I checked the value of primary_node_id and it was 0.. I thought that means the primary node is down. I see now.

This is my pgpool conf and log (level DEBUG5) before the patch (and docker-compose to setup databases).
I did the steps until "FATAL: sorry, too many clients already" message.

Please let me now if You need anything else.
pgpool.conf (43,572 bytes)
docker-compose.yml (1,022 bytes)

t-ishii

2020-12-14 19:39

developer   ~0003661

Thank you for the additional information. I seemed to miss you set "ALWAYS_MASTER" flag to node 0. Let me investigate more.

t-ishii

2020-12-14 21:17

developer   ~0003662

Ok, it appears that if you set ALWAYS_MASTER flag, then you need to set DISALLOW_TO_FAILOVER as well, i.e.:

backend_flag0 = 'ALWAYS_PRIMARY|DISALLOW_TO_FAILOVER'

Otherwise, pgpool is happy to disconnect primary (node 0) with setting primary node id to 0. Later on pgpool child accesses node 0 if they need to access to the primary node, and segfaults.

Probably pgpool need to check at the start up: if ALWAYS_PRIMARY is set and DISALLOW_TO_FAILOVER is not set. If not, pgpool should not start up, I think.

wjuszko

2020-12-14 23:32

reporter   ~0003663

Sorry I'm getting lost. Here is my full story:
Some years ago (0000396:0002014 , version 3.1.1 I guess) I managed to set this architecture and behavior thanks to pgpool2:

- machine 1: primary RW DB
- machine 2 : a secondary RO DB with pgpool and my app
- machine 3 : a secondary RO DB with pgpool and my app

My both apps (on machine 2 and 3) worked on the primary DB (on machin 1) normally. When something happened and lost connection to the primary, pgpool2 switched to the secondary node (that was running on the same machine) and my app could continue work without noticing the change (write operations failed, but I could deal with it because there were mostly reads). I checked by a cron job if the master was back and attached to the pool again. It worked fine.

I'm trying to do the same now, with pgppol2 version 4.1.5, but no luck. The similar almost works with:

backend_flag0 = 'ALWAYS_MASTER|ALLOW_TO_FAILOVER'
But after some reattachment I get the "FATAL: sorry, too many clients already" message. If I understand You right ALWAYS_MASTER and ALLOW_TO_FAILOVER should not to be set at the same time, but

If I set backend_flag0 = 'ALWAYS_MASTER|DISALLOW_TO_FAILOVER' pgpool will refuses to connect even if the secondary node is available
If I set backend_flag0 = 'ALLOW_TO_FAILOVER' my client timeouts.

So these are not an option for me. Maybe I miss something. Please suggest. I attached the old v3.1.1 pgpool.conf
pgpool-2.conf (19,530 bytes)

t-ishii

2020-12-15 09:34

developer   ~0003664

Last edited: 2020-12-15 09:35

Ok, after working a little bit harder, I think I found a solution with "backend_flag0 = 'ALWAYS_MASTER|ALLOW_TO_FAILOVER'".

In the previous patch:

    if (STREAM && reqkind == NODE_UP_REQUEST && all_backend_down == false &&
            Req_info->primary_node_id >= 0)

I added "&& Req_info->primary_node_id != node_id":

    if (STREAM && reqkind == NODE_UP_REQUEST && all_backend_down == false &&
            Req_info->primary_node_id >= 0 && Req_info->primary_node_id != node_id)

The intention of this is, let pgpool restart all children if requested fail back node (node_id) is same as the previous primary node (Req_info->primary_node_id ) . Can you please try this out?

wjuszko

2020-12-15 18:04

reporter   ~0003665

Thanks! It works great.

Will You release it with 4.1.6?

t-ishii

2020-12-15 20:02

developer   ~0003666

Yes! It is supposed to be released on February 18, 2021.
https://pgpool.net/mediawiki/index.php/Roadmap

Issue History

Date Modified Username Field Change
2020-12-12 00:40 wjuszko New Issue
2020-12-12 09:33 t-ishii Assigned To => t-ishii
2020-12-12 09:33 t-ishii Status new => assigned
2020-12-12 09:44 t-ishii Note Added: 0003655
2020-12-12 09:55 t-ishii Note Edited: 0003655
2020-12-12 10:05 t-ishii File Added: failback.diff
2020-12-12 10:05 t-ishii Note Added: 0003656
2020-12-12 10:12 t-ishii Note Added: 0003657
2020-12-12 11:05 t-ishii Status assigned => feedback
2020-12-12 11:05 t-ishii Target Version => 4.1.6
2020-12-13 22:59 wjuszko Note Added: 0003658
2020-12-13 22:59 wjuszko Status feedback => assigned
2020-12-14 08:41 t-ishii Note Added: 0003659
2020-12-14 08:41 t-ishii Status assigned => feedback
2020-12-14 19:16 wjuszko File Added: pgppol2.log
2020-12-14 19:16 wjuszko File Added: pgpool.conf
2020-12-14 19:16 wjuszko File Added: docker-compose.yml
2020-12-14 19:16 wjuszko Note Added: 0003660
2020-12-14 19:16 wjuszko Status feedback => assigned
2020-12-14 19:39 t-ishii Note Added: 0003661
2020-12-14 21:17 t-ishii Note Added: 0003662
2020-12-14 21:18 t-ishii Status assigned => feedback
2020-12-14 23:32 wjuszko File Added: pgpool-2.conf
2020-12-14 23:32 wjuszko Note Added: 0003663
2020-12-14 23:32 wjuszko Status feedback => assigned
2020-12-15 09:34 t-ishii Note Added: 0003664
2020-12-15 09:34 t-ishii Note Edited: 0003664
2020-12-15 09:35 t-ishii Note Edited: 0003664
2020-12-15 09:35 t-ishii Status assigned => feedback
2020-12-15 18:04 wjuszko Note Added: 0003665
2020-12-15 18:04 wjuszko Status feedback => assigned
2020-12-15 20:02 t-ishii Note Added: 0003666
2020-12-17 06:59 t-ishii Status assigned => resolved
2021-02-19 09:42 administrator Status resolved => closed
2021-02-19 09:42 administrator Fixed in Version => 4.1.6