[pgpool-hackers: 4220] Possible bugs

Thu Nov 3 23:55:19 JST 2022

Hello Tatsuo

Hope you are doing well.

We are running pgpool in our infrastructure and have been so far satisfied with the performance in general failure scenarios and we wanted to thank you for that. But we performed some additional stress testing in bit more complex scenarios and saw a few differences from the expected behaviour. I would like to bring them to your notice and check if these are bugs or as expected features. :))

  1.  Reads fail while failover is being performed for the replica even when the load balancing is off - Ideally, if the load balancing is disabled, the failure of a replica should be completely transparent to the clients. But what we observed is when the replica goes down and pgpool is performing a failover, the client reads fail at that time.

[cid:3934F3A6F4FD4DB09AC8DE2EEA0FF405]

  2.  Problems with replica’s storage shouldn’t cause any issues in the working of the cluster if load_balance_mode is off - I checked with you earlier<https://www.sraoss.jp/pipermail/pgpool-hackers/2021-July/003968.html> and you told me that the way pgpool health check works for PostgreSQL nodes does not always ensure if the storage of the host where PostgreSQL is running is reachable or not. But then, we tested the same thing for a replica - we disabled the storage of the replica host and we expected that when the load balancing is disabled, even if pgpool isn’t able to detect any problems with the replica storage, the cluster should still work fine. But this was not the case, pgpool got stuck.

[cid:5223798B5FDD4D668F419E7F1D958AD1]

  3.  Pgpool doesn’t respect the health check time period if PostgreSQL node is killed - If we kill a PostgreSQL node by shutting it down ourselves or even by a kill -9 command, pgpool performs a failover instantly and doesn’t wait for the health check period before performing the failover. If pgpool can always recognise when a PostgreSQL node was shut down by admins, then the first case is still okay, but if the PostgreSQL node is killed by kill -9 command, we expected that pgpool would wait for the timeout period. We are afraid that the lack of this waiting period might trigger a failover in false positive scenarios (if the PostgreSQL node is still working but there was some temporary intermittent issue). Can you please clarify if this is a real risk or throw some light on when pgpool decides to perform an instantaneous failover instead of waiting for the timeout period?

I know some of these scenarios are not something that would happen frequently but some of these we have already seen in our infrastructure and we just want to make sure that there are no data inconsistencies or loss if they occur again. Thank you for taking the time to go through these.

Cheers!

Anirudh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-hackers/attachments/20221103/aced392e/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screenshot 2022-11-03 at 3.37.32 PM.png
Type: image/png
Size: 87761 bytes
Desc: Screenshot 2022-11-03 at 3.37.32 PM.png
URL: <http://www.pgpool.net/pipermail/pgpool-hackers/attachments/20221103/aced392e/attachment-0001.png>