[pgpool-hackers: 4221] Re: Possible bugs

Fri Nov 4 17:44:47 JST 2022

Hi,

> Hello Tatsuo
> 
> Hope you are doing well.
> 
> We are running pgpool in our infrastructure and have been so far satisfied with the performance in general failure scenarios and we wanted to thank you for that. But we performed some additional stress testing in bit more complex scenarios and saw a few differences from the expected behaviour. I would like to bring them to your notice and check if these are bugs or as expected features. :))
> 
>   1.  Reads fail while failover is being performed for the replica even when the load balancing is off - Ideally, if the load balancing is disabled, the failure of a replica should be completely transparent to the clients. But what we observed is when the replica goes down and pgpool is performing a failover, the client reads fail at that time.

Because currently pgpool connects to replica even if load_balance_mode
is off. Maybe we can avoid to connect to replica when
load_balance_mode is off but this will make the pgpool's code more
complicated. I am not sure if this is worth the trouble since most of
pgpool users enable load_balance_mode.

>   2.  Problems with replica’s storage shouldn’t cause any issues in the working of the cluster if load_balance_mode is off - I checked with you earlier<https://www.sraoss.jp/pipermail/pgpool-hackers/2021-July/003968.html> and you told me that the way pgpool health check works for PostgreSQL nodes does not always ensure if the storage of the host where PostgreSQL is running is reachable or not. But then, we tested the same thing for a replica - we disabled the storage of the replica host and we expected that when the load balancing is disabled, even if pgpool isn’t able to detect any problems with the replica storage, the cluster should still work fine. But this was not the case, pgpool got stuck.

Probably disabling the storage somewhat affects the connection between
pgpool and PostgreSQL. I can not say more without details.

>   3.  Pgpool doesn’t respect the health check time period if PostgreSQL node is killed - If we kill a PostgreSQL node by shutting it down ourselves or even by a kill -9 command, pgpool performs a failover instantly and doesn’t wait for the health check period before performing the failover. If pgpool can always recognise when a PostgreSQL node was shut down by admins, then the first case is still okay, but if the PostgreSQL node is killed by kill -9 command, we expected that pgpool would wait for the timeout period. We are afraid that the lack of this waiting period might trigger a failover in false positive scenarios (if the PostgreSQL node is still working but there was some temporary intermittent issue). Can you please clarify if this is a real risk or throw some light on when pgpool decides to perform an instantaneous failover instead of waiting for the timeout period?

Please upgrade to Pgpool-II 4.3 which allows you to disable the
instantaneous failover when backend goes down by using new parameter
"failover_on_backend_shutdown".
See:
https://www.pgpool.net/docs/latest/en/html/runtime-config-failover.html
for more details.

> I know some of these scenarios are not something that would happen frequently but some of these we have already seen in our infrastructure and we just want to make sure that there are no data inconsistencies or loss if they occur again. Thank you for taking the time to go through these.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp