[pgpool-general: 8734] Re: Clients disconnection when slave node is off

Wed Apr 12 08:36:00 JST 2023

> Thank you for your explanation Tatsuo.
> 
> I also reproduced the issue with a script that executes "select 1" via psql
> 10 times a second (please find attached it).
> Besides, when I have pgadmin4 tool connected to the VIP, I can observe how
> pgadmin4 is disconnected.
> 
> I'm working in a workaround that consists in set DISALLOW_FAILOVER in
> backend1 to change it to ALLOW_FAILOVER in the failover script (via sed
> command) and reload the pgpool configuration during failover. I don't like
> it but I need to avoid client disconnection.

Yeah, it's painful.

> Do you think we should report it as a pgpool bug to work on it in future
> releases?

No, we accept reports in the mailing lists as official reports.
I will continue to work on this issue.

BTW, after thinking more, I came up to an idea: Currently as soon as
pgpool finds failover is going on, it exits and disconnects session.
Instead of this, pgpool could wait for the failover to
finish. Attached is the patch to implement the idea. With this, I
don't see session disconnection any more and pgbench finishes without
error.

However a downside of this is, while failover clients cannot process
queries or at least slow down processing. Below is the log from
pgbench using "-P 1" option to show progress. As you can see from 170
s pgbench starts to slow down and recovers at 194 s. That is, the
slowdown continued for 24 seconds.

progress: 166.0 s, 560.0 tps, lat 17.738 ms stddev 13.628, 0 failed
progress: 167.0 s, 548.0 tps, lat 18.306 ms stddev 13.485, 0 failed
progress: 168.0 s, 488.0 tps, lat 20.491 ms stddev 14.539, 0 failed
progress: 169.0 s, 103.0 tps, lat 63.388 ms stddev 117.504, 0 failed
progress: 170.0 s, 2.0 tps, lat 510.080 ms stddev 389.100, 0 failed
progress: 171.0 s, 1.0 tps, lat 1187.785 ms stddev 0.000, 0 failed
progress: 172.0 s, 93.1 tps, lat 336.584 ms stddev 930.328, 0 failed
progress: 173.0 s, 520.0 tps, lat 18.625 ms stddev 13.354, 0 failed
progress: 174.0 s, 16.0 tps, lat 400.134 ms stddev 214.655, 0 failed
progress: 175.0 s, 13.0 tps, lat 731.237 ms stddev 387.539, 0 failed
progress: 176.0 s, 15.0 tps, lat 602.090 ms stddev 320.367, 0 failed
progress: 177.0 s, 370.0 tps, lat 41.239 ms stddev 129.628, 0 failed
progress: 178.0 s, 228.9 tps, lat 28.355 ms stddev 40.748, 0 failed
progress: 179.0 s, 3.0 tps, lat 1348.203 ms stddev 145.777, 0 failed
progress: 180.0 s, 12.0 tps, lat 1049.585 ms stddev 763.708, 0 failed
progress: 181.0 s, 16.0 tps, lat 671.118 ms stddev 737.817, 0 failed
progress: 182.0 s, 19.0 tps, lat 663.073 ms stddev 569.615, 0 failed
progress: 183.0 s, 10.0 tps, lat 693.354 ms stddev 287.951, 0 failed
progress: 184.0 s, 17.0 tps, lat 677.589 ms stddev 494.581, 0 failed
progress: 185.0 s, 237.0 tps, lat 64.185 ms stddev 219.713, 0 failed
progress: 186.0 s, 526.0 tps, lat 18.992 ms stddev 13.189, 0 failed
progress: 187.0 s, 403.0 tps, lat 20.286 ms stddev 21.798, 0 failed
progress: 188.0 s, 12.0 tps, lat 476.790 ms stddev 333.039, 0 failed
progress: 189.0 s, 172.0 tps, lat 93.301 ms stddev 277.075, 0 failed
progress: 190.0 s, 450.0 tps, lat 19.282 ms stddev 16.146, 0 failed
progress: 191.0 s, 9.0 tps, lat 549.026 ms stddev 299.587, 0 failed
progress: 192.0 s, 12.0 tps, lat 842.556 ms stddev 641.569, 0 failed
progress: 193.0 s, 16.0 tps, lat 839.678 ms stddev 565.659, 0 failed
progress: 194.0 s, 420.0 tps, lat 30.650 ms stddev 68.103, 0 failed
progress: 195.0 s, 517.0 tps, lat 19.338 ms stddev 13.230, 0 failed
progress: 196.0 s, 471.0 tps, lat 21.200 ms stddev 18.464, 0 failed

Maybe this is ok, because while failover is performed, any connection
attempt to pgpool is refused. I think slow processing is better than
connection refusing.

However a question is remaining. From the pgpool.log, the failover
started at 22:44:04 and finished at 22:44:05.

2023-04-11 22:44:04.481: main pid 327143: LOG:  === Starting degeneration. shutdown host localhost(11003) ===
2023-04-11 22:44:04.661: main pid 327143: LOG:  Do not restart children because we are switching over node id 1 host: localhost port: 11003 and we are in streaming replication mode
2023-04-11 22:44:04.662: main pid 327143: LOG:  execute command: /home/t-ishii/work/Pgpool-II/current/x/etc/failover.sh  1 localhost  11003 /home/t-ishii/work/Pgpool-II/current/x/data1  0 localhost 0 0 11002 /home/t-ishii/work/Pgpool-II/current/x/data0 localhost 11002
2023-04-11 22:44:04.670: main pid 327143: LOG:  failover: set new primary node: 0
2023-04-11 22:44:04.670: main pid 327143: LOG:  failover: set new main node: 0
2023-04-11 22:44:04.670: main pid 327143: LOG:  === Failover done. shutdown host localhost(11003) ===
2023-04-11 22:44:04.670: sr_check_worker pid 327797: ERROR:  Failed to check replication time lag
2023-04-11 22:44:04.670: sr_check_worker pid 327797: DETAIL:  No persistent db connection for the node 1
2023-04-11 22:44:04.670: sr_check_worker pid 327797: HINT:  check sr_check_user and sr_check_password
2023-04-11 22:44:04.670: sr_check_worker pid 327797: CONTEXT:  while checking replication time lag
2023-04-11 22:44:04.670: sr_check_worker pid 327797: LOG:  worker process received restart request
2023-04-11 22:44:05.502: pgbench pid 327800: LOG:  DB node id: 0 backend pid: 327896 statement: BEGIN;

That is, the failover only took 1 second. Why pgbench went slow for 24
seconds?

I will continue research on this.

> Thanks in advance.
> 
> Best,
> Jesús
> 
> El mar, 11 abr 2023 8:49, Tatsuo Ishii <ishii at sraoss.co.jp> escribió:
> 
>> Ok, I think I have found the cause of the issue. Unfortunately I am
>> not able to find any solution for this at the moment. Let me explain.
>>
>> First of all, I was able to reproduce the issue by using pgbench and
>> pgpool_setup (a test tool coming with pgpool to create pgbench +
>> PostgreSQL cluster).
>>
>> $ pgpool_setup
>> $ echo "load_balance_mode = off" >> etc/pgpool.conf
>> $ echo "failover_on_backend_error = off" >> etc/pgpool.conf
>> $ ./startall
>> $ pgbench -p 11000 -n -c 1 -T 300 test
>>
>> At this point I hoped a standby node shutdown does not affect pgbench,
>> because load balance mode is off, pgpool should only access to primary
>> node.
>>
>> So I entered:
>> $ pg_ctl -D data1 stop
>>
>> Then pgbench aborted shortly:
>>
>> pgbench: error: client 0 aborted in command 5 (SQL) of script 0; perhaps
>> the backend died while processing
>> [snip]
>>
>> After debugging pgpool using gdb, I got stack trace.
>>
>> #6  0x000056538ef05fc4 in child_exit (code=code at entry=1) at
>> protocol/child.c:1336
>> #7  0x000056538ef2ba6a in pool_virtual_main_db_node_id () at
>> context/pool_query_context.c:341
>> #8  0x000056538ef0e42b in read_packets_and_process (frontend=frontend at entry=0x5653900cb4c8,
>> backend=backend at entry=0x7f732f21fcd8, reset_request=reset_request at entry=0,
>>
>>     state=state at entry=0x7ffcef518b6c, num_fields=num_fields at entry=0x7ffcef518b6a,
>> cont=cont at entry=0x7ffcef518b74 "\001V") at
>> protocol/pool_process_query.c:5098
>> #9  0x000056538ef0f1e3 in pool_process_query (frontend=0x5653900cb4c8,
>> backend=0x7f732f21fcd8, reset_request=reset_request at entry=0)
>>     at protocol/pool_process_query.c:298
>> #10 0x000056538ef0734a in do_child (fds=fds at entry=0x565390107ba0) at
>> protocol/child.c:455
>> #11 0x000056538eedb906 in fork_a_child (fds=0x565390107ba0, id=26) at
>> main/pgpool_main.c:842
>> #12 0x000056538eee3e2a in PgpoolMain (discard_status=<optimized out>,
>> clear_memcache_oidmaps=<optimized out>) at main/pgpool_main.c:541
>> #13 0x000056538eed9826 in main (argc=<optimized out>, argv=<optimized
>> out>) at main/main.c:365
>>
>> As the child process exits and the socket connection closed, and
>> pgbench complains it. Why pgpool exited? Pgpool maintains information
>> including, which node is primary, which node is up or
>> down. pool_virtual_main_db_node_id() returns such that
>> information. Unfortunately the information is changing while
>> failover. If pgpool process uses unstable information, it would cause
>> bad results including segfaults (that actually happend in the
>> past. See mantis bug track no. 481 and 482). Thus if
>> pool_virtual_main_db_node_id() finds that failover is ongoing, it
>> exits the process and session is disconnected. One solution for this
>> is, locking the information by pgpool process. However, this is not
>> only hard to implement, but could result in dead lock. So I hesitate
>> to employ it.
>>
>> BTW, the reson why the issue happens in pgbench, and does not happen
>> in psql is, I think pgbench produces extremely busy transactions which
>> increases the chance to reproduce the problem.
>>
>> Best reagards,
>> --
>> Tatsuo Ishii
>> SRA OSS LLC
>> English: http://www.sraoss.co.jp/index_en/
>> Japanese:http://www.sraoss.co.jp
>>
>> > Hi,
>> >
>> > Thanks. I will looking into this.
>> >
>> > Best reagards,
>> > --
>> > Tatsuo Ishii
>> > SRA OSS LLC
>> > English: http://www.sraoss.co.jp/index_en/
>> > Japanese:http://www.sraoss.co.jp
>> >
>> >> Hi,
>> >>
>> >> Please find attached logs.
>> >>
>> >> Thank you.
>> >>
>> >> Best,
>> >> Jesús
>> >>
>> >> El lun, 3 abr 2023 11:06, Tatsuo Ishii <ishii at sraoss.co.jp> escribió:
>> >>
>> >>> Hi,
>> >>>
>> >>> Can you share pgpool log?
>> >>>
>> >>> > Hi Tatsuo,
>> >>> >
>> >>> > Should I test anything else to try to solve my problem?
>> >>> >
>> >>> > Thank you.
>> >>> >
>> >>> > Best,
>> >>> > Jesús
>> >>> >
>> >>> > El vie, 31 mar 2023 12:10, Jesús Campoy <jesuscampoy at gmail.com>
>> >>> escribió:
>> >>> >
>> >>> >> Hi,
>> >>> >>
>> >>> >> I just performed the same tests with auto_failback to off but the
>> >>> >> behaviour is the same (clients are disconnected).
>> >>> >>
>> >>> >> Thank you Tatsuo.
>> >>> >>
>> >>> >> Best,
>> >>> >> Jesús
>> >>> >>
>> >>> >> El vie, 31 mar 2023 2:31, Tatsuo Ishii <ishii at sraoss.co.jp>
>> escribió:
>> >>> >>
>> >>> >>> Hi Jesús,
>> >>> >>>
>> >>> >>> Can you try again after setting "auto_failback = off"? I suspect
>> >>> >>> auto_failback confuses pgpool.
>> >>> >>>
>> >>> >>> Best reagards,
>> >>> >>> --
>> >>> >>> Tatsuo Ishii
>> >>> >>> SRA OSS LLC
>> >>> >>> English: http://www.sraoss.co.jp/index_en/
>> >>> >>> Japanese:http://www.sraoss.co.jp
>> >>> >>>
>> >>> >>> > Hi Tatsuo,
>> >>> >>> >
>> >>> >>> > When I connect to pgpool with psql this session is not
>> disconnected.
>> >>> >>> > However, I've performed a test with pgbench inserting data with
>> 30
>> >>> >>> clients
>> >>> >>> > in the database and when I shutdown server2 some clients of
>> pgbench
>> >>> are
>> >>> >>> > disconnected.
>> >>> >>> > Please find attached a zip file with pgpool logs, pgbench log and
>> >>> >>> > configurations of pgpool and postgres.
>> >>> >>> >
>> >>> >>> > Thank you for your assistance in this matter.
>> >>> >>> >
>> >>> >>> > Best,
>> >>> >>> > Jesús
>> >>> >>> >
>> >>> >>> > El mié, 22 mar 2023 a las 2:05, Tatsuo Ishii (<
>> ishii at sraoss.co.jp>)
>> >>> >>> > escribió:
>> >>> >>> >
>> >>> >>> >> > Please find attached my pgpool config and a log file when the
>> >>> standby
>> >>> >>> >> > (server2) is powered off.
>> >>> >>> >> > Thank you for your help.
>> >>> >>> >>
>> >>> >>> >> I assume server2 = host B.
>> >>> >>> >>
>> >>> >>> >> I have looked into the log file but failed to find log lines
>> related
>> >>> >>> >> to user sessions which were diconnected. I was looking for such
>> log
>> >>> >>> >> lines because you said:
>> >>> >>> >>
>> >>> >>> >> > Sometimes I have to power off the host B and then, the clients
>> >>> >>> connected
>> >>> >>> >>
>> >>> >>> >> If such an event occurs, there should be such log lines.
>> >>> >>> >> Many log lines like:
>> >>> >>> >>
>> >>> >>> >> 2023-03-20 10:59:27.725: [unknown] pid 31499: LOG:  failover or
>> >>> >>> failback
>> >>> >>> >> event detected
>> >>> >>> >> 2023-03-20 10:59:27.725: [unknown] pid 31499: DETAIL:
>> restarting
>> >>> >>> myself
>> >>> >>> >> 2023-03-20 10:59:27.726: main pid 30237: LOG:  child process
>> with
>> >>> pid:
>> >>> >>> >> 31499 exits with status 256
>> >>> >>> >> 2023-03-20 10:59:27.727: main pid 30237: LOG:  fork a new child
>> >>> process
>> >>> >>> >> with pid: 29017
>> >>> >>> >>
>> >>> >>> >> just show that process 31499 is not related to any client
>> session
>> >>> >>> >> ([unknown] indicates this) and even if the process exited, any
>> >>> client
>> >>> >>> >> will not be affected.
>> >>> >>> >>
>> >>> >>> >> Can you connect to pgpool using psql and shutdown server2 so
>> that
>> >>> log
>> >>> >>> >> lines I am expecting are recorded?
>> >>> >>> >>
>> >>> >>> >> Best reagards,
>> >>> >>> >> --
>> >>> >>> >> Tatsuo Ishii
>> >>> >>> >> SRA OSS LLC
>> >>> >>> >> English: http://www.sraoss.co.jp/index_en/
>> >>> >>> >> Japanese:http://www.sraoss.co.jp
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >> > Best,
>> >>> >>> >> > Jesús
>> >>> >>> >> >
>> >>> >>> >> > El vie, 17 mar 2023 a las 0:13, Tatsuo Ishii (<
>> ishii at sraoss.co.jp
>> >>> >)
>> >>> >>> >> > escribió:
>> >>> >>> >> >
>> >>> >>> >> >> > Ok, I will send you the log ASAP.
>> >>> >>> >> >> > I forget to indicate that we are running two instances of
>> >>> pgpool
>> >>> >>> using
>> >>> >>> >> >> > watchdog and VIP.
>> >>> >>> >> >> >
>> >>> >>> >> >> > I mean, in host A is running pgpool (active) and primary
>> >>> >>> database. In
>> >>> >>> >> >> host
>> >>> >>> >> >> > B is running the other instance of pgpool and the standby
>> >>> >>> database.
>> >>> >>> >> >> > Sometimes I have to power off the host B and then, the
>> clients
>> >>> >>> >> connected
>> >>> >>> >> >> to
>> >>> >>> >> >> > pgpool in VIP are disconnected.
>> >>> >>> >> >> >
>> >>> >>> >> >> > I have the same pgpool.conf for both pgpool instances. Do
>> you
>> >>> >>> need It?
>> >>> >>> >> >>
>> >>> >>> >> >> No, one pgpool.conf is enough.
>> >>> >>> >> >>
>> >>> >>> >> >> > Thanks for your help!
>> >>> >>> >> >>
>> >>> >>> >> >> You are welcome.
>> >>> >>> >> >> --
>> >>> >>> >> >> Tatsuo Ishii
>> >>> >>> >> >> SRA OSS LLC
>> >>> >>> >> >> English: http://www.sraoss.co.jp/index_en/
>> >>> >>> >> >> Japanese:http://www.sraoss.co.jp
>> >>> >>> >> >>
>> >>> >>> >>
>> >>> >>>
>> >>> >>
>> >>>
>> > _______________________________________________
>> > pgpool-general mailing list
>> > pgpool-general at pgpool.net
>> > http://www.pgpool.net/mailman/listinfo/pgpool-general
>>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: failover.patch
Type: text/x-patch
Size: 1632 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20230412/8f4f695f/attachment.bin>