[pgpool-hackers: 2545] (no subject)

Tatsuo Ishii ishii at sraoss.co.jp
Thu Sep 21 15:18:27 JST 2017


To test a problem reported in [pgpool-general: 5728], I was running
./startall failover and ./shutdown loop in a cluster created by
pgpool_setup. I found sometimes ./shutdown doesn't finish, and the
cause was pgpool parent waiting for pcp_main finishes. I attached gdb
to the pcp_main process to see what's going on. Here's the backtrace.

(gdb) bt
#0  __lll_lock_wait_private () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1  0x00007f78d47eac76 in __tz_convert (timer=0x7f78d4af3ac0 <tzset_lock>, timer at entry=0x7ffd01c8fbe8, 
    use_localtime=use_localtime at entry=1, tp=tp at entry=0x7f78d4af64a0 <_tmbuf>) at tzset.c:616
#2  0x00007f78d47e8941 in __GI_localtime (t=t at entry=0x7ffd01c8fbe8) at localtime.c:39
#3  0x000000000045c386 in log_line_prefix (buf=buf at entry=0x7ffd01c8fcc0, line_prefix=<optimized out>, 
    edata=0x75d9a0 <errordata+384>) at ../../src/utils/error/elog.c:2059
#4  0x000000000045efb5 in send_message_to_server_log (edata=0x75d9a0 <errordata+384>)
    at ../../src/utils/error/elog.c:2084
#5  EmitErrorReport () at ../../src/utils/error/elog.c:1129
#6  0x000000000045d13e in errfinish (dummy=<optimized out>) at ../../src/utils/error/elog.c:434
#7  0x00000000004094d5 in exit_handler (sig=15) at main/pgpool_main.c:1415
#8  <signal handler called>
#9  __lll_lock_wait_private () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:93
#10 0x00007f78d47eac76 in __tz_convert (timer=0x7f78d4af3ac0 <tzset_lock>, timer at entry=0x7ffd01c90428, 
    use_localtime=use_localtime at entry=1, tp=tp at entry=0x7f78d4af64a0 <_tmbuf>) at tzset.c:616
#11 0x00007f78d47e8941 in __GI_localtime (t=t at entry=0x7ffd01c90428) at localtime.c:39
#12 0x000000000045c386 in log_line_prefix (buf=buf at entry=0x7ffd01c90500, line_prefix=<optimized out>, 
    edata=0x75d8e0 <errordata+192>) at ../../src/utils/error/elog.c:2059
#13 0x000000000045efb5 in send_message_to_server_log (edata=0x75d8e0 <errordata+192>)
    at ../../src/utils/error/elog.c:2084
#14 EmitErrorReport () at ../../src/utils/error/elog.c:1129
#15 0x000000000045d13e in errfinish (dummy=<optimized out>) at ../../src/utils/error/elog.c:434
#16 0x00000000004094d5 in exit_handler (sig=2) at main/pgpool_main.c:1415
#17 <signal handler called>
#18 __tzfile_compute (timer=1505947017, use_localtime=use_localtime at entry=1, 
    leap_correct=leap_correct at entry=0x7ffd01c90bb8, leap_hit=leap_hit at entry=0x7ffd01c90bb4, 
    tp=tp at entry=0x7f78d4af64a0 <_tmbuf>) at tzfile.c:648
#19 0x00007f78d47ead7f in __tz_convert (timer=timer at entry=0x7ffd01c90c08, 
    use_localtime=use_localtime at entry=1, tp=tp at entry=0x7f78d4af64a0 <_tmbuf>) at tzset.c:624
#20 0x00007f78d47e8941 in __GI_localtime (t=t at entry=0x7ffd01c90c08) at localtime.c:39
#21 0x000000000045c386 in log_line_prefix (buf=buf at entry=0x7ffd01c90ce0, line_prefix=<optimized out>, 
    edata=0x75d820 <errordata>) at ../../src/utils/error/elog.c:2059
#22 0x000000000045efb5 in send_message_to_server_log (edata=0x75d820 <errordata>)
    at ../../src/utils/error/elog.c:2084
#23 EmitErrorReport () at ../../src/utils/error/elog.c:1129
#24 0x000000000045d13e in errfinish (dummy=<optimized out>) at ../../src/utils/error/elog.c:434
#25 0x0000000000416415 in pcp_main (unix_fd=unix_fd at entry=8, inet_fd=inet_fd at entry=9)
    at pcp_con/pcp_child.c:104
#26 0x000000000040a7b4 in pcp_fork_a_child (unix_fd=8, inet_fd=9, pcp_conf_file=<optimized out>)
    at main/pgpool_main.c:709
#27 0x000000000040d073 in failover () at main/pgpool_main.c:2221
#28 0x000000000040d84e in sigusr1_interrupt_processor () at main/pgpool_main.c:1588
#29 0x000000000040da2d in pool_sleep (second=<optimized out>) at main/pgpool_main.c:2752
#30 0x000000000040fc37 in PgpoolMain (discard_status=discard_status at entry=1 '\001', 
    clear_memcache_oidmaps=clear_memcache_oidmaps at entry=0 '\000') at main/pgpool_main.c:536
#31 0x00000000004081d5 in main (argc=<optimized out>, argv=<optimized out>) at main/main.c:313

It seems failover() in pgpool main forks pcp_main, then it is
immediately interrupted at line 104 (this is Pgpool-II 3.6.6):

	ereport(DEBUG1,
			(errmsg("I am PCP child with pid:%d",getpid())));

then jump to the signal handler: exit_handler in *pgpool main* because
it was interrupted by signal 2 (SIGINT by fast shutdown). In the
signal handler, ereport was called and ereport called tz_convert which
acquired a lock. While the process holding the lock, it was interrupted
again by signal 15 (SIGTERM), which I don't know why. Anyway then the
signal handler called ereport thus tz_convert again, which waited
forever for acquiring a lock because the lock was already acquired. I
saw multiple problems here.

1) pcp main should not call ereport until their own exception handler
   is established. I think we can just move the ereport call after the
   exit handler established. i.e. After this:

	/* We can now handle ereport(ERROR) */
	PG_exception_stack = &local_sigjmp_buf;

2) We should eliminate calling ereport/elog calls in signal handlers
   to prevent this kind of tz_convert race condition.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


More information about the pgpool-hackers mailing list