[pgpool-hackers: 3792] Re: pg_terminate_backend() does not work in native replication mode

Tue Aug 25 20:19:52 JST 2020

Hi Usama,

I lean toward the second solution. The third solution sounds ideal but
actually we have exactly the same problem with pg_cancel_backend()
too, and probably we have more. So instead creating new pgpool local
functions each time we find problem, let's explain the issue in the
documentation as more general problem descripton/soltion.

> Hi Ishii-San
> 
> I have been thinking about this issue in the past days and unfortunately
> couldn't think of a good solution.
> 
> As you mentioned in native replication mode and with snapshot isolation
> as well almost all queries are sent to multiple PostgreSQL backends and
> even when we issue pg_terminate_backend for one backend process
> the Pgpool-II keeps on waiting for the other backends to finish their tasks.
> Now there could be few solutions to this problem but each has its own
> drawback.
> 
> The first solution would be to identify all the relevant backend PIDs
> and issue individual pg_terminate_backend function to each backend
> node after substituting the appropriate PID argument to the function.
> But the problem with this solution is not only this is complex to
> implement but also has some fundamental differences from the
> actual functionality provided by the pg_terminate_backend(),
> Since the pg_terminate_backend is supposed to kill a single troubling
> backend process that was identified by the user from pg_stat_activity or any
> other similar means so the user might not expect the function to influence
> or kill some other backend on a different PG server. SO that might
> come as a surprise for some users.
> 
> The second solution is to clearly mention this behavior in the
> documentation and informs the users about how to proceed
> with pg_terminate_backend in native replication and snapshot
> isolation modes.
> i.e.
> step1: get the PIDs of all the backend process related to the particular
> pgpool child process
> step2: issue pg_terminate_backend for each PID individually.
> 
> The third solution I can think of is to add a new pgpool-II local
> function for the purpose.
> For example, we can add a new function in Pgpool-II say
> pool_terminate_all_backends(pgpool-child-pid) and that new function
> issues pg_terminate_backed() from within for each backend process.
> 
> What are your thoughts on these possible solutions and on a way forward?
> 
> Thanks
> 
> 
> On Thu, Aug 20, 2020 at 11:01 AM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
> 
>> Another fundamental problem with pg_terminate_backend() with native
>> replication mode (and snapshot isolation mode) is, pgpool needs to
>> send pg_termnaite_backend() with different argument because the
>> argument is a process id, which is not incosistent among backends.
>>
>> > Hi Usma,
>> >
>> > While looking into the 073.pg_terminate_backend test failure I found
>> > interesting issue.
>> >
>> > Supoose we execute following SQL in native replication mode:
>> >
>> > ssesion 1: select pg_sleep(60); /* at time 't1' */
>> >
>> > session 2: select pg_terminate_backend('7615');       /* at time 't2' */
>> >
>> > The pg_sleep() should be canceled at time t2, but actually it is
>> > canceled at t2 + 60 seconds. Also after the cancel we get:
>> >
>> > WARNING:  packet kind of backend 1 ['D'] does not match with
>> master/majority nodes packet kind ['E']
>> > WARNING:  write on backend 0 failed with error :"Success"
>> > DETAIL:  while trying to write data from offset: 0 wlen: 5
>> > FATAL:  failed to read kind from backend
>> > DETAIL:  kind mismatch among backends. Possible last query was: "select
>> pg_sleep(60);" kind details are: 0[E: terminating connection due to
>> administrator command] 1[D]
>> > HINT:  check data consistency among db nodes
>> >
>> > What actually happening here is:
>> >
>> > 2020-08-20 13:01:46: psql pid 7603: LOG:  DB node id: 0 backend pid:
>> 7615 statement: BEGIN
>> > 2020-08-20 13:01:46: psql pid 7603: LOG:  DB node id: 1 backend pid:
>> 7616 statement: BEGIN
>> > 2020-08-20 13:01:46: psql pid 7603: LOG:  DB node id: 0 backend pid:
>> 7615 statement: select pg_sleep(60); <-- pgpool 7603 waiting for response
>> from backend 0.
>> > 2020-08-20 13:02:06: psql pid 7598: LOG:  DB node id: 0 backend pid:
>> 7632 statement: SELECT version()
>> > 2020-08-20 13:02:06: psql pid 7598: LOG:  DB node id: 0 backend pid:
>> 7632 statement: SELECT count(*) FROM pg_catalog.pg_proc AS p,
>> pg_catalog.pg_namespace AS n WHERE p.proname = 'pg_terminate_backend' AND
>> n.oid = p.pronamespace AND n.nspname ~ '.*' AND p.provolatile = 'v'
>> > 2020-08-20 13:02:06: psql pid 7598: LOG:  found the pg_terminate_backend
>> request for backend pid:7615 on backend node:0
>> > 2020-08-20 13:02:06: psql pid 7598: DETAIL:  setting the connection flag
>> > 2020-08-20 13:02:06: psql pid 7598: LOG:  DB node id: 0 backend pid:
>> 7632 statement: select pg_terminate_backend(7615);
>> > 2020-08-20 13:02:06: psql pid 7603: LOG:  DB node id: 1 backend pid:
>> 7616 statement: select pg_sleep(60);  <--- pgpool 7603 got response because
>> pg_terminate_backend executed. pgpool 7603 started to wait for response
>> from backend 1.
>> > 2020-08-20 13:03:06: psql pid 7603: WARNING:  packet kind of backend 1
>> ['D'] does not match with master/majority nodes packet kind ['E'] <-- after
>> 60 seconds passed, pgpool 7603 got response from bacnend 0 and 1. <-- since
>> backend 0 got error while backend 1 sucessfully executed pg_sleep(60),
>> there were difference in packet kind.
>> > 2020-08-20 13:03:06: psql pid 7603: FATAL:  failed to read kind from
>> backend <-- and pgpool get angry!
>> > 2020-08-20 13:03:06: psql pid 7603: DETAIL:  kind mismatch among
>> backends. Possible last query was: "select pg_sleep(60);" kind details are:
>> 0[E: terminating connection due to administrator command] 1[D]
>> > 2020-08-20 13:03:06: psql pid 7603: HINT:  check data consistency among
>> db nodes
>> > 2020-08-20 13:03:06: psql pid 7603: WARNING:  write on backend 0 failed
>> with error :"Success"
>> > 2020-08-20 13:03:06: psql pid 7603: DETAIL:  while trying to write data
>> from offset: 0 wlen: 5
>> > 2020-08-20 13:03:06: main pid 7572: LOG:  child process with pid: 7603
>> exits with status 512
>> >
>> >
>> > Any idea how to deal with this problem?
>> >
>> > Best regards,
>> > --
>> > Tatsuo Ishii
>> > SRA OSS, Inc. Japan
>> > English: http://www.sraoss.co.jp/index_en.php
>> > Japanese:http://www.sraoss.co.jp
>> > _______________________________________________
>> > pgpool-hackers mailing list
>> > pgpool-hackers at pgpool.net
>> > http://www.pgpool.net/mailman/listinfo/pgpool-hackers
>>
> 
> 
> -- 
> ...
> Muhammad Usama
> Highgo Software (Canada/China/Pakistan)
> URL : http://www.highgo.ca
> ADDR: 10318 WHALLEY BLVD, Surrey, BC