[pgpool-hackers: 4201] Re: Dynamic spare process management of Pgpool-II children

Tue Oct 18 15:12:56 JST 2022

Hi Usama,

> after applying the patch, I have run regression test and encountered 7
> timeout.
> 
> testing 008.dbredirect...timeout.
> testing 018.detach_primary...timeout.
> testing 033.prefer_lower_standby_delay...timeout.
> testing 034.promote_node...timeout.
> testing 075.detach_primary_left_down_node...timeout.
> testing 076.copy_hang...timeout.
> testing 077.invalid_failover_node...timeout.

I found that some tests above failed when running pgpool_setup with 3 nodes.

It seems that after running pcp_recovery_node, all child processes exited
with the messages "exited with success and will not be restarted".

It can be reproduced by running "pgpool_setup -m s -n 3" command.
The log is attached.

On Wed, 14 Sep 2022 06:48:57 +0900 (JST)
Tatsuo Ishii <ishii at sraoss.co.jp> wrote:

> Hi Usama,
> 
> after applying the patch, I have run regression test and encountered 7
> timeout.
> 
> testing 008.dbredirect...timeout.
> testing 018.detach_primary...timeout.
> testing 033.prefer_lower_standby_delay...timeout.
> testing 034.promote_node...timeout.
> testing 075.detach_primary_left_down_node...timeout.
> testing 076.copy_hang...timeout.
> testing 077.invalid_failover_node...timeout.
> 
> > Hi Ishii San,
> > 
> > Please find the rebased version attached.
> > 
> > Best regards
> > Muhammad Usama
> > 
> > On Tue, Sep 13, 2022 at 9:06 AM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
> > 
> >> Hi Usama,
> >>
> >> > Thanks!
> >> >
> >> > I will look into this and get back to you.
> >>
> >> Unfortunately your patch does not apply any more because of recent
> >> commit.  Can you please rebase it?
> > 
> > 
> >> $ git apply ~/dynamic_spare_process_management.diff
> >> error: patch failed: src/main/pgpool_main.c:115
> >> error: src/main/pgpool_main.c: patch does not apply
> >>
> >> > Best reagards,
> >> > --
> >> > Tatsuo Ishii
> >> > SRA OSS LLC
> >> > English: http://www.sraoss.co.jp/index_en/
> >> > Japanese:http://www.sraoss.co.jp
> >> >
> >> >> Hi Hackers.
> >> >>
> >> >> Few years back we had a discussion on implementing the on-demand child
> >> >> process spawning and "zhoujianshen at highgo.com" also shared a patch for
> >> that.
> >> >> Ref:
> >> >>
> >> https://www.sraoss.jp/pipermail/pgpool-hackers/2020-September/003831.html
> >> >>
> >> >> The patch had a few issues and review comments and somehow or the other
> >> it
> >> >> never made it to the committable state. So I decided to take that up and
> >> >> re-work on that.
> >> >>
> >> >> Little background:
> >> >> The motivation behind this feature is that while deciding the value of
> >> >> num_init_children configuration, the administrator has to figure out the
> >> >> maximum number of concurrent client connections that needs to be
> >> supported
> >> >> by the setup even if that maximum number might only hit once in a day or
> >> >> even once in a month depending on the type of setup, and 90% of time
> >> only
> >> >> 5-10% of connections are required. But because Pgpool-II always spawns
> >> >> num_init_children number of child processes at the startup, so for such
> >> >> setups majority of the time the huge amount of child processes keep
> >> sitting
> >> >> idle and consume the system resources. This approach is suboptimal in
> >> terms
> >> >> of system resource usage and also causes problems like 'thundering
> >> herd' (
> >> >> although we do have a serialize-accept to get around that) in some
> >> cases.
> >> >>
> >> >> So the idea is to keep the spare child processes (processes that are
> >> >> sitting idle in 'waiting for connection' state) within the configured
> >> >> limits., and depending on the connected client count scale up/down the
> >> >> number of child processes.
> >> >>
> >> >> Attached re-worked patch:
> >> >> The original patch on the topic had few shortcomings but my biggest
> >> concern
> >> >> was the approach that it uses to scale down the child processes. IMHO
> >> the
> >> >> patch was too aggressive in bringing down the child process if
> >> >> max_spare_child and the victim process identification was not smart.
> >> >> Secondly the responsibility to manage the spare children was not
> >> properly
> >> >> segregated and was shared between main and child processes.
> >> >>
> >> >> So I took up the patch and basically redesigned it from the ground up.
> >> The
> >> >> attached version of patch gives the responsibility of keeping the track
> >> of
> >> >> spare processes and scaling them to the main process and also implements
> >> >> three strategies of scaling down. On top of that it also adds the switch
> >> >> that can be used to turn off this auto scaling feature and brings back
> >> >> current behaviour.
> >> >>
> >> >> Moreover instead of using a new configuration (max_children as in the
> >> >> original patch) the attached one uses the existing num_init_children
> >> config
> >> >> to keep the backward compatibility.
> >> >>
> >> >> To summarise, the patch adds the following new config parameters to
> >> control
> >> >> the process scaling
> >> >>
> >> >> -- process_management_mode (default = static )
> >> >> can be set to either static or dynamic. while static keeps the current
> >> >> behaviour and dynamic mode enables the auto scaling of spare processes
> >> >>
> >> >> -- process_management_strategy (default = gentle )
> >> >> Configures the process management strategy to satisfy spare processes
> >> count.
> >> >> Valid options:
> >> >> lazy:
> >> >> In this mode the scale down is performed gradually
> >> >> and only gets triggered when excessive spare processes count
> >> >> remains high for more than 5 mins
> >> >> gentle:
> >> >> In this mode the scale down is performed gradually
> >> >> and only gets triggered when excessive spare processes count
> >> >> remains high for more than 2 mins
> >> >> aggressive:
> >> >> In this mode the scale down is performed aggressively
> >> >> and gets triggered more frequently in case of higher spare processes.
> >> >> This mode uses faster and slightly less smart process selection criteria
> >> >>  to identify the child processes that can be serviced to satisfy
> >> >> max_spare_children
> >> >>
> >> >> -- min_spare_children
> >> >> Minimum number of spare child processes to keep in waiting for
> >> connection
> >> >> state
> >> >> This works only for dynamic process management mode
> >> >>
> >> >> --max_spare_children
> >> >> Maximum number of spare child processes to keep in waiting for
> >> connection
> >> >> state
> >> >> This works only for dynamic process management mode
> >> >>
> >> >> Furthermore, the patch relies on existing conn_counter to keep track of
> >> >> connected children count, that means it does not add additional
> >> overhead of
> >> >> computing that information, and the documentation updates are still not
> >> >> part of the patch and I will add those once we have an agreement on the
> >> >> approach, and usability of the feature.
> >> >>
> >> >>
> >> >> Meanwhile I am trying to figure out a way to benchmark this feature if
> >> it
> >> >> adds any performance benefits but haven't been able to figure out the
> >> way
> >> >> to do that currently. So any suggestions on this topic is welcome.
> >> >>
> >> >> Thanks
> >> >> Best rgards
> >> >> Muhammad Usama
> >> > _______________________________________________
> >> > pgpool-hackers mailing list
> >> > pgpool-hackers at pgpool.net
> >> > http://www.pgpool.net/mailman/listinfo/pgpool-hackers
> >>
> _______________________________________________
> pgpool-hackers mailing list
> pgpool-hackers at pgpool.net
> http://www.pgpool.net/mailman/listinfo/pgpool-hackers

-- 
Bo Peng <pengbo at sraoss.co.jp>
SRA OSS LLC
https://www.sraoss.co.jp/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pgpool.log
Type: application/octet-stream
Size: 21872 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-hackers/attachments/20221018/6efc6f4a/attachment-0001.obj>