binutils-gdb/gdb/testsuite/gdb.base/valgrind-infcall.exp

135 lines
3.7 KiB
Plaintext
Raw Normal View History

# Copyright 2012-2017 Free Software Foundation, Inc.
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
if [is_remote target] {
# The test always runs locally.
return 0
}
standard_testfile .c
if {[build_executable $testfile.exp $testfile $srcfile {debug}] == -1} {
return -1
}
set test "spawn valgrind"
set cmd "valgrind --vgdb-error=0 $binfile"
Remove superfluous semicolons from testsuite throughout. A few months ago semicolons after "return" were removed throughout the testsuite. However, as I pointed out in review, they're unnecessary not just after "return", but pretty much after any tcl command. ';' is the command separator, and you only need it if there's another command on the same line afterwards. This patch was written by running: $ find . -name "*.exp" | xargs grep -l ";\s*$" | xargs sed -i 's/\([^#][^\s*;]*\)\s*;\s*$/\1/' and then undoing changes to comments, and lib/future.exp. Tested on x86_64 Fedora 17. gdb/testsuite/ 2013-06-07 Pedro Alves <palves@redhat.com> * boards/native-extended-gdbserver.exp: Remove semicolon. * config/arm-ice.exp: Likewise. * config/bfin.exp: Likewise. * config/cygmon.exp: Likewise. * config/h8300.exp: Likewise. * config/monitor.exp: Likewise. * config/sid.exp: Likewise. * config/sim.exp: Likewise. * config/slite.exp: Likewise. * config/vx.exp: Likewise. * gdb.arch/i386-bp_permanent.exp: Likewise. * gdb.asm/asm-source.exp: Likewise. * gdb.base/args.exp: Likewise. * gdb.base/attach-pie-misread.exp: Likewise. * gdb.base/auxv.exp: Likewise. * gdb.base/bigcore.exp: Likewise. * gdb.base/bitfields2.exp: Likewise. * gdb.base/bitfields.exp: Likewise. * gdb.base/break.exp: Likewise. * gdb.base/break-interp.exp: Likewise. * gdb.base/callfuncs.exp: Likewise. * gdb.base/call-sc.exp: Likewise. * gdb.base/commands.exp: Likewise. * gdb.base/corefile.exp: Likewise. * gdb.base/dbx.exp: Likewise. * gdb.base/ending-run.exp: Likewise. * gdb.base/exprs.exp: Likewise. * gdb.base/funcargs.exp: Likewise. * gdb.base/hbreak2.exp: Likewise. * gdb.base/huge.exp: Likewise. * gdb.base/list.exp: Likewise. * gdb.base/memattr.exp: Likewise. * gdb.base/overlays.exp: Likewise. * gdb.base/printcmds.exp: Likewise. * gdb.base/recurse.exp: Likewise. * gdb.base/remotetimeout.exp: Likewise. * gdb.base/reread.exp: Likewise. * gdb.base/savedregs.exp: Likewise. * gdb.base/scope.exp: Likewise. * gdb.base/sepdebug.exp: Likewise. * gdb.base/setshow.exp: Likewise. * gdb.base/setvar.exp: Likewise. * gdb.base/sigaltstack.exp: Likewise. * gdb.base/siginfo-addr.exp: Likewise. * gdb.base/siginfo.exp: Likewise. * gdb.base/siginfo-obj.exp: Likewise. * gdb.base/sigrepeat.exp: Likewise. * gdb.base/sigstep.exp: Likewise. * gdb.base/structs.exp: Likewise. * gdb.base/testenv.exp: Likewise. * gdb.base/twice.exp: Likewise. * gdb.base/valgrind-db-attach.exp: Likewise. * gdb.base/valgrind-infcall.exp: Likewise. * gdb.base/varargs.exp: Likewise. * gdb.base/watchpoint.exp: Likewise. * gdb.cp/gdb1355.exp: Likewise. * gdb.cp/misc.exp: Likewise. * gdb.disasm/hppa.exp: Likewise. * gdb.disasm/t01_mov.exp: Likewise. * gdb.disasm/t02_mova.exp: Likewise. * gdb.disasm/t03_add.exp: Likewise. * gdb.disasm/t04_sub.exp: Likewise. * gdb.disasm/t05_cmp.exp: Likewise. * gdb.disasm/t06_ari2.exp: Likewise. * gdb.disasm/t07_ari3.exp: Likewise. * gdb.disasm/t08_or.exp: Likewise. * gdb.disasm/t09_xor.exp: Likewise. * gdb.disasm/t10_and.exp: Likewise. * gdb.disasm/t11_logs.exp: Likewise. * gdb.disasm/t12_bit.exp: Likewise. * gdb.disasm/t13_otr.exp: Likewise. * gdb.gdb/selftest.exp: Likewise. * gdb.hp/gdb.base-hp/callfwmall.exp: Likewise. * gdb.mi/mi-reverse.exp: Likewise. * gdb.pascal/floats.exp: Likewise. * gdb.python/py-inferior.exp: Likewise. * gdb.threads/attach-into-signal.exp: Likewise. * gdb.threads/pthreads.exp: Likewise. * gdb.threads/thread_events.exp: Likewise. * gdb.threads/watchthreads.exp: Likewise. * gdb.trace/actions-changed.exp: Likewise. * gdb.trace/actions.exp: Likewise. * gdb.trace/ax.exp: Likewise. * gdb.trace/backtrace.exp: Likewise. * gdb.trace/change-loc.exp: Likewise. * gdb.trace/deltrace.exp: Likewise. * gdb.trace/disconnected-tracing.exp: Likewise. * gdb.trace/ftrace.exp: Likewise. * gdb.trace/infotrace.exp: Likewise. * gdb.trace/passc-dyn.exp: Likewise. * gdb.trace/passcount.exp: Likewise. * gdb.trace/pending.exp: Likewise. * gdb.trace/qtro.exp: Likewise. * gdb.trace/range-stepping.exp: Likewise. * gdb.trace/report.exp: Likewise. * gdb.trace/save-trace.exp: Likewise. * gdb.trace/status-stop.exp: Likewise. * gdb.trace/strace.exp: Likewise. * gdb.trace/tfile.exp: Likewise. * gdb.trace/tfind.exp: Likewise. * gdb.trace/trace-break.exp: Likewise. * gdb.trace/tracecmd.exp: Likewise. * gdb.trace/trace-mt.exp: Likewise. * gdb.trace/tspeed.exp: Likewise. * gdb.trace/tsv.exp: Likewise. * gdb.trace/while-stepping.exp: Likewise. * lib/gdb.exp: Likewise. * lib/gdbserver-support.exp: Likewise. * lib/java.exp: Likewise. * lib/mi-support.exp: Likewise. * lib/pascal.exp: Likewise. * lib/prompt.exp: Likewise. * lib/trace-support.exp: Likewise.
2013-06-07 19:31:09 +02:00
set res [remote_spawn host $cmd]
if { $res < 0 || $res == "" } {
verbose -log "Spawning $cmd failed."
unsupported $test
return -1
}
pass $test
# Declare GDB now as running.
Introduce gdb_interact in testsuite gdb_interact is a small utility that we have found quite useful to debug test cases. Putting gdb_interact in a test suspends it and allows to interact with gdb to inspect whatever you want. You can then type ">>>" to resume the test execution. Of course, this is only for gdb devs. It wouldn't make sense to leave a gdb_interact permanently in a test case. When starting the interaction with the user, the script prints this banner: +------------------------------------------+ | Script interrupted, you can now interact | | with by gdb. Type >>> to continue. | +------------------------------------------+ Notes: * When gdb is launched, the gdb_spawn_id variable (lib/gdb.exp) is assigned -1. Given the name, I would expect it to contain the gdb expect spawn id, which is needed for interact. I changed all places that set gdb_spawn_id to -1 to set it to the actual gdb spawn id instead. * When entering the "interact" mode, the last (gdb) prompt is already eaten by expect, so it doesn't show up on the terminal. Subsequent prompts do appear though. We tried to print "(gdb)" just before the interact to replace it. However, it could be misleading if you are debugging an MI test case, it makes you think that you are typing in a CLI prompt, when in reality it's MI. In the end I decided that since the feature is for developers who know what they're doing and that one is normally consciously using gdb_interact, the script doesn't need to babysit the user. * There are probably some quirks depending on where in the script gdb_interact appears (e.g. it could interfere with following commands and make them fail), but it works for most cases. Quirks can always be fixed later. The idea and original implementation was contributed by Anders Granlund, a colleague of mine. Thanks to him. gdb/testsuite/ChangeLog: * gdb.base/statistics.exp: Assign spawn id to gdb_spawn_id. * gdb.base/valgrind-db-attach.exp: Same. * gdb.base/valgrind-infcall.exp: Same. * lib/mi-support.exp (default_mi_gdb_start): Same. * lib/prompt.exp (default_prompt_gdb_start): Same. * lib/gdb.exp (default_gdb_spawn): Same. (gdb_interact): New.
2015-01-22 20:33:04 +01:00
set gdb_spawn_id $res
# GDB started by vgdb stops already after the startup is executed, like with
# non-extended gdbserver. It is also not correct to run/attach the inferior.
set use_gdb_stub 1
set test "valgrind started"
# The trailing '.' differs for different memcheck versions.
gdb_test_multiple "" $test {
-re "Memcheck, a memory error detector\\.?\r\n" {
pass $test
}
-re "valgrind: failed to start tool 'memcheck' for platform '.*': No such file or directory" {
unsupported $test
return -1
}
-re "valgrind: wrong ELF executable class" {
unsupported $test
return -1
}
-re "command not found" {
# The spawn succeeded, but then valgrind was not found - e.g. if
# we spawned SSH to a remote system.
unsupported $test
return -1
}
-re "valgrind: Bad option.*--vgdb-error=0" {
# valgrind is not >= 3.7.0.
unsupported $test
return -1
}
}
set test "vgdb prompt"
# The trailing '.' differs for different memcheck versions.
gdb_test_multiple "" $test {
-re " (target remote | \[^\r\n\]*/vgdb \[^\r\n\]*)\r\n" {
set vgdbcmd $expect_out(1,string)
pass $test
}
}
# Do not kill valgrind.
testsuite: tcl exec& -> 'kill -9 $pid' is racy (attach-many-short-lived-thread.exp races and others) The buildbots show that attach-many-short-lived-thread.exp is racy. But after staring at debug logs and playing with SystemTap scripts for a (long) while, I figured out that neither GDB, nor the kernel nor the test's program itself are at fault. The problem is simply that the testsuite machinery is currently subject to PID-reuse races. The attach-many-short-lived-threads.c test program just happens to be much more susceptible to trigger this race because threads and processes share the same number space on Linux, and the test spawns many many short lived threads in succession, thus enlarging the race window a lot. Part of the problem is that several tests spawn processes with "exec&" (in order to test the "attach" command) , and then at the end of the test, to make sure things are cleaned up, issue a 'remote_spawn "kill -p $testpid"'. Since with tcl's "exec&", tcl itself is responsible for reaping the process's exit status, when we go kill the process, testpid may have already exited _and_ its status may have (and often has) been reaped already. Thus it can happen that another process meanwhile reuses $testpid, and that "kill" command kills the wrong process... Frequently, that happens to be attach-many-short-lived-thread, but this explains other test's races as well. In the attach-many-short-lived-threads test, it sometimes manifests like this: (gdb) file /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads Reading symbols from /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads...done. (gdb) Loaded /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads into /home/pedro/gdb/mygit/build/gdb/testsuite/../../gdb/gdb attach 5940 Attaching to program: /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads, process 5940 warning: process 5940 is a zombie - the process has already terminated ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ptrace: Operation not permitted. (gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 1: attach info threads No threads. (gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 1: no new threads set breakpoint always-inserted on (gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 1: set breakpoint always-inserted on Other times the process dies while the test is ongoing (the process is ptrace-stopped): (gdb) print again = 1 Cannot access memory at address 0x6020cc (gdb) FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 2: reset timer in the inferior (Recall that on Linux, SIGKILL is not interceptable) And other times it dies just while we're detaching: $4 = 319 (gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 2: print seconds_left detach Can't detach Thread 0x7fb13b7de700 (LWP 1842): No such process (gdb) FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 2: detach GDB mishandles the latter (it should ignore ESRCH while detaching just like when continuing), but that's another story. The fix here is to change spawn_wait_for_attach to use Expect's 'spawn' command instead of Tcl's 'exec&' to spawn programs, because with spawn we control when to wait for/reap the process. That allows killing the process by PID without being subject to pid-reuse races, because even if the process is already dead, the kernel won't reuse the process's PID until the zombie is reaped. The other part of the problem lies in DejaGnu itself, unfortunately. I have occasionally seen tests (attach-many-short-lived-threads included, but not only that one) die with a random inexplicable SIGTERM too, and that too is caused by the same reason, except that in that case, the rogue SIGTERM is sent from this bit in DejaGnu's remote.exp: exec sh -c "exec > /dev/null 2>&1 && (kill -2 $pgid || kill -2 $pid) && sleep 5 && (kill $pgid || kill $pid) && sleep 5 && (kill -9 $pgid || kill -9 $pid) &" ... catch "wait -i $shell_id" Even if the program exits promptly, that whole cascade of kills carries on in the background, thus potentially killing the poor process that manages to reuse $pid... I sent a fix for that to the DejaGnu list: http://lists.gnu.org/archive/html/dejagnu/2015-07/msg00000.html With both patches in place, I haven't seen attach-many-short-lived-threads.exp fail again. Tested on x86_64 Fedora 20, native, gdbserver and extended-gdbserver. gdb/testsuite/ChangeLog: 2015-07-31 Pedro Alves <palves@redhat.com> * gdb.base/attach-pie-misread.exp: Rename $res to $test_spawn_id. Use spawn_id_get_pid. Wait for spawn id after eof. Use kill_wait_spawned_process instead of explicit "kill -9". * gdb.base/attach-pie-noexec.exp: Adjust to spawn_wait_for_attach returning a spawn id instead of a pid. Use spawn_id_get_pid and kill_wait_spawned_process. * gdb.base/attach-twice.exp: Likewise. * gdb.base/attach.exp: Likewise. (do_command_attach_tests): Use gdb_spawn_with_cmdline_opts and gdb_test_multiple. * gdb.base/solib-overlap.exp: Adjust to spawn_wait_for_attach returning a spawn id instead of a pid. Use spawn_id_get_pid and kill_wait_spawned_process. * gdb.base/valgrind-infcall.exp: Likewise. * gdb.multi/multi-attach.exp: Likewise. * gdb.python/py-prompt.exp: Likewise. * gdb.python/py-sync-interp.exp: Likewise. * gdb.server/ext-attach.exp: Likewise. * gdb.threads/attach-into-signal.exp (corefunc): Use spawn_wait_for_attach, spawn_id_get_pid and kill_wait_spawned_process. * gdb.threads/attach-many-short-lived-threads.exp: Adjust to spawn_wait_for_attach returning a spawn id instead of a pid. Use spawn_id_get_pid and kill_wait_spawned_process. * gdb.threads/attach-stopped.exp (corefunc): Use spawn_wait_for_attach, spawn_id_get_pid and kill_wait_spawned_process. * gdb.base/break-interp.exp: Rename $res to $test_spawn_id. Use spawn_id_get_pid. Wait for spawn id after eof. Use kill_wait_spawned_process instead of explicit "kill -9". * lib/gdb.exp (can_spawn_for_attach): Adjust comment. (kill_wait_spawned_process, spawn_id_get_pid): New procedures. (spawn_wait_for_attach): Use spawn instead of exec to spawn processes. Don't map cygwin/windows pids here. Now returns a spawn id list.
2015-07-31 21:06:24 +02:00
set valgrind_spawn_id [board_info host fileid]
unset gdb_spawn_id
set board [host_info name]
unset_board_info fileid
clean_restart $testfile
# Make sure we're disconnected, in case we're testing with the
# native-extended-gdbserver board, where gdb_start/gdb_load spawn
# gdbserver and connect to it.
gdb_test "disconnect" ".*"
gdb_test "$vgdbcmd" " in \\.?_start .*" "target remote for vgdb"
gdb_test "monitor v.set gdb_output" "valgrind output will go to gdb.*"
set continue_count 1
set loop 1
while {$loop && $continue_count < 100} {
set test "continue #$continue_count"
gdb_test_multiple "continue" "" {
-re "Invalid free\\(\\).*: main .*\r\n$gdb_prompt $" {
pass $test
set loop 0
}
-re "Remote connection closed.*\r\n$gdb_prompt $" {
fail "$test (remote connection closed)"
# Only if valgrind got stuck.
testsuite: tcl exec& -> 'kill -9 $pid' is racy (attach-many-short-lived-thread.exp races and others) The buildbots show that attach-many-short-lived-thread.exp is racy. But after staring at debug logs and playing with SystemTap scripts for a (long) while, I figured out that neither GDB, nor the kernel nor the test's program itself are at fault. The problem is simply that the testsuite machinery is currently subject to PID-reuse races. The attach-many-short-lived-threads.c test program just happens to be much more susceptible to trigger this race because threads and processes share the same number space on Linux, and the test spawns many many short lived threads in succession, thus enlarging the race window a lot. Part of the problem is that several tests spawn processes with "exec&" (in order to test the "attach" command) , and then at the end of the test, to make sure things are cleaned up, issue a 'remote_spawn "kill -p $testpid"'. Since with tcl's "exec&", tcl itself is responsible for reaping the process's exit status, when we go kill the process, testpid may have already exited _and_ its status may have (and often has) been reaped already. Thus it can happen that another process meanwhile reuses $testpid, and that "kill" command kills the wrong process... Frequently, that happens to be attach-many-short-lived-thread, but this explains other test's races as well. In the attach-many-short-lived-threads test, it sometimes manifests like this: (gdb) file /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads Reading symbols from /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads...done. (gdb) Loaded /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads into /home/pedro/gdb/mygit/build/gdb/testsuite/../../gdb/gdb attach 5940 Attaching to program: /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads, process 5940 warning: process 5940 is a zombie - the process has already terminated ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ptrace: Operation not permitted. (gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 1: attach info threads No threads. (gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 1: no new threads set breakpoint always-inserted on (gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 1: set breakpoint always-inserted on Other times the process dies while the test is ongoing (the process is ptrace-stopped): (gdb) print again = 1 Cannot access memory at address 0x6020cc (gdb) FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 2: reset timer in the inferior (Recall that on Linux, SIGKILL is not interceptable) And other times it dies just while we're detaching: $4 = 319 (gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 2: print seconds_left detach Can't detach Thread 0x7fb13b7de700 (LWP 1842): No such process (gdb) FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 2: detach GDB mishandles the latter (it should ignore ESRCH while detaching just like when continuing), but that's another story. The fix here is to change spawn_wait_for_attach to use Expect's 'spawn' command instead of Tcl's 'exec&' to spawn programs, because with spawn we control when to wait for/reap the process. That allows killing the process by PID without being subject to pid-reuse races, because even if the process is already dead, the kernel won't reuse the process's PID until the zombie is reaped. The other part of the problem lies in DejaGnu itself, unfortunately. I have occasionally seen tests (attach-many-short-lived-threads included, but not only that one) die with a random inexplicable SIGTERM too, and that too is caused by the same reason, except that in that case, the rogue SIGTERM is sent from this bit in DejaGnu's remote.exp: exec sh -c "exec > /dev/null 2>&1 && (kill -2 $pgid || kill -2 $pid) && sleep 5 && (kill $pgid || kill $pid) && sleep 5 && (kill -9 $pgid || kill -9 $pid) &" ... catch "wait -i $shell_id" Even if the program exits promptly, that whole cascade of kills carries on in the background, thus potentially killing the poor process that manages to reuse $pid... I sent a fix for that to the DejaGnu list: http://lists.gnu.org/archive/html/dejagnu/2015-07/msg00000.html With both patches in place, I haven't seen attach-many-short-lived-threads.exp fail again. Tested on x86_64 Fedora 20, native, gdbserver and extended-gdbserver. gdb/testsuite/ChangeLog: 2015-07-31 Pedro Alves <palves@redhat.com> * gdb.base/attach-pie-misread.exp: Rename $res to $test_spawn_id. Use spawn_id_get_pid. Wait for spawn id after eof. Use kill_wait_spawned_process instead of explicit "kill -9". * gdb.base/attach-pie-noexec.exp: Adjust to spawn_wait_for_attach returning a spawn id instead of a pid. Use spawn_id_get_pid and kill_wait_spawned_process. * gdb.base/attach-twice.exp: Likewise. * gdb.base/attach.exp: Likewise. (do_command_attach_tests): Use gdb_spawn_with_cmdline_opts and gdb_test_multiple. * gdb.base/solib-overlap.exp: Adjust to spawn_wait_for_attach returning a spawn id instead of a pid. Use spawn_id_get_pid and kill_wait_spawned_process. * gdb.base/valgrind-infcall.exp: Likewise. * gdb.multi/multi-attach.exp: Likewise. * gdb.python/py-prompt.exp: Likewise. * gdb.python/py-sync-interp.exp: Likewise. * gdb.server/ext-attach.exp: Likewise. * gdb.threads/attach-into-signal.exp (corefunc): Use spawn_wait_for_attach, spawn_id_get_pid and kill_wait_spawned_process. * gdb.threads/attach-many-short-lived-threads.exp: Adjust to spawn_wait_for_attach returning a spawn id instead of a pid. Use spawn_id_get_pid and kill_wait_spawned_process. * gdb.threads/attach-stopped.exp (corefunc): Use spawn_wait_for_attach, spawn_id_get_pid and kill_wait_spawned_process. * gdb.base/break-interp.exp: Rename $res to $test_spawn_id. Use spawn_id_get_pid. Wait for spawn id after eof. Use kill_wait_spawned_process instead of explicit "kill -9". * lib/gdb.exp (can_spawn_for_attach): Adjust comment. (kill_wait_spawned_process, spawn_id_get_pid): New procedures. (spawn_wait_for_attach): Use spawn instead of exec to spawn processes. Don't map cygwin/windows pids here. Now returns a spawn id list.
2015-07-31 21:06:24 +02:00
kill_wait_spawned_process $valgrind_spawn_id
return -1
}
-re "The program is not being run\\.\r\n$gdb_prompt $" {
fail "$test (valgrind vgdb has terminated)"
# Only if valgrind got stuck.
testsuite: tcl exec& -> 'kill -9 $pid' is racy (attach-many-short-lived-thread.exp races and others) The buildbots show that attach-many-short-lived-thread.exp is racy. But after staring at debug logs and playing with SystemTap scripts for a (long) while, I figured out that neither GDB, nor the kernel nor the test's program itself are at fault. The problem is simply that the testsuite machinery is currently subject to PID-reuse races. The attach-many-short-lived-threads.c test program just happens to be much more susceptible to trigger this race because threads and processes share the same number space on Linux, and the test spawns many many short lived threads in succession, thus enlarging the race window a lot. Part of the problem is that several tests spawn processes with "exec&" (in order to test the "attach" command) , and then at the end of the test, to make sure things are cleaned up, issue a 'remote_spawn "kill -p $testpid"'. Since with tcl's "exec&", tcl itself is responsible for reaping the process's exit status, when we go kill the process, testpid may have already exited _and_ its status may have (and often has) been reaped already. Thus it can happen that another process meanwhile reuses $testpid, and that "kill" command kills the wrong process... Frequently, that happens to be attach-many-short-lived-thread, but this explains other test's races as well. In the attach-many-short-lived-threads test, it sometimes manifests like this: (gdb) file /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads Reading symbols from /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads...done. (gdb) Loaded /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads into /home/pedro/gdb/mygit/build/gdb/testsuite/../../gdb/gdb attach 5940 Attaching to program: /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads, process 5940 warning: process 5940 is a zombie - the process has already terminated ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ptrace: Operation not permitted. (gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 1: attach info threads No threads. (gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 1: no new threads set breakpoint always-inserted on (gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 1: set breakpoint always-inserted on Other times the process dies while the test is ongoing (the process is ptrace-stopped): (gdb) print again = 1 Cannot access memory at address 0x6020cc (gdb) FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 2: reset timer in the inferior (Recall that on Linux, SIGKILL is not interceptable) And other times it dies just while we're detaching: $4 = 319 (gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 2: print seconds_left detach Can't detach Thread 0x7fb13b7de700 (LWP 1842): No such process (gdb) FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 2: detach GDB mishandles the latter (it should ignore ESRCH while detaching just like when continuing), but that's another story. The fix here is to change spawn_wait_for_attach to use Expect's 'spawn' command instead of Tcl's 'exec&' to spawn programs, because with spawn we control when to wait for/reap the process. That allows killing the process by PID without being subject to pid-reuse races, because even if the process is already dead, the kernel won't reuse the process's PID until the zombie is reaped. The other part of the problem lies in DejaGnu itself, unfortunately. I have occasionally seen tests (attach-many-short-lived-threads included, but not only that one) die with a random inexplicable SIGTERM too, and that too is caused by the same reason, except that in that case, the rogue SIGTERM is sent from this bit in DejaGnu's remote.exp: exec sh -c "exec > /dev/null 2>&1 && (kill -2 $pgid || kill -2 $pid) && sleep 5 && (kill $pgid || kill $pid) && sleep 5 && (kill -9 $pgid || kill -9 $pid) &" ... catch "wait -i $shell_id" Even if the program exits promptly, that whole cascade of kills carries on in the background, thus potentially killing the poor process that manages to reuse $pid... I sent a fix for that to the DejaGnu list: http://lists.gnu.org/archive/html/dejagnu/2015-07/msg00000.html With both patches in place, I haven't seen attach-many-short-lived-threads.exp fail again. Tested on x86_64 Fedora 20, native, gdbserver and extended-gdbserver. gdb/testsuite/ChangeLog: 2015-07-31 Pedro Alves <palves@redhat.com> * gdb.base/attach-pie-misread.exp: Rename $res to $test_spawn_id. Use spawn_id_get_pid. Wait for spawn id after eof. Use kill_wait_spawned_process instead of explicit "kill -9". * gdb.base/attach-pie-noexec.exp: Adjust to spawn_wait_for_attach returning a spawn id instead of a pid. Use spawn_id_get_pid and kill_wait_spawned_process. * gdb.base/attach-twice.exp: Likewise. * gdb.base/attach.exp: Likewise. (do_command_attach_tests): Use gdb_spawn_with_cmdline_opts and gdb_test_multiple. * gdb.base/solib-overlap.exp: Adjust to spawn_wait_for_attach returning a spawn id instead of a pid. Use spawn_id_get_pid and kill_wait_spawned_process. * gdb.base/valgrind-infcall.exp: Likewise. * gdb.multi/multi-attach.exp: Likewise. * gdb.python/py-prompt.exp: Likewise. * gdb.python/py-sync-interp.exp: Likewise. * gdb.server/ext-attach.exp: Likewise. * gdb.threads/attach-into-signal.exp (corefunc): Use spawn_wait_for_attach, spawn_id_get_pid and kill_wait_spawned_process. * gdb.threads/attach-many-short-lived-threads.exp: Adjust to spawn_wait_for_attach returning a spawn id instead of a pid. Use spawn_id_get_pid and kill_wait_spawned_process. * gdb.threads/attach-stopped.exp (corefunc): Use spawn_wait_for_attach, spawn_id_get_pid and kill_wait_spawned_process. * gdb.base/break-interp.exp: Rename $res to $test_spawn_id. Use spawn_id_get_pid. Wait for spawn id after eof. Use kill_wait_spawned_process instead of explicit "kill -9". * lib/gdb.exp (can_spawn_for_attach): Adjust comment. (kill_wait_spawned_process, spawn_id_get_pid): New procedures. (spawn_wait_for_attach): Use spawn instead of exec to spawn processes. Don't map cygwin/windows pids here. Now returns a spawn id list.
2015-07-31 21:06:24 +02:00
kill_wait_spawned_process $valgrind_spawn_id
return -1
}
-re "\r\n$gdb_prompt $" {
pass "$test (false warning)"
}
}
set continue_count [expr $continue_count + 1]
}
set test "p gdb_test_infcall ()"
gdb_test_multiple $test $test {
-re "unhandled instruction bytes.*\r\n$gdb_prompt $" {
fail $test
}
-re "Continuing \\.\\.\\..*\r\n\\\$1 = 2\r\n$gdb_prompt $" {
pass $test
}
}
# Only if valgrind got stuck.
testsuite: tcl exec& -> 'kill -9 $pid' is racy (attach-many-short-lived-thread.exp races and others) The buildbots show that attach-many-short-lived-thread.exp is racy. But after staring at debug logs and playing with SystemTap scripts for a (long) while, I figured out that neither GDB, nor the kernel nor the test's program itself are at fault. The problem is simply that the testsuite machinery is currently subject to PID-reuse races. The attach-many-short-lived-threads.c test program just happens to be much more susceptible to trigger this race because threads and processes share the same number space on Linux, and the test spawns many many short lived threads in succession, thus enlarging the race window a lot. Part of the problem is that several tests spawn processes with "exec&" (in order to test the "attach" command) , and then at the end of the test, to make sure things are cleaned up, issue a 'remote_spawn "kill -p $testpid"'. Since with tcl's "exec&", tcl itself is responsible for reaping the process's exit status, when we go kill the process, testpid may have already exited _and_ its status may have (and often has) been reaped already. Thus it can happen that another process meanwhile reuses $testpid, and that "kill" command kills the wrong process... Frequently, that happens to be attach-many-short-lived-thread, but this explains other test's races as well. In the attach-many-short-lived-threads test, it sometimes manifests like this: (gdb) file /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads Reading symbols from /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads...done. (gdb) Loaded /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads into /home/pedro/gdb/mygit/build/gdb/testsuite/../../gdb/gdb attach 5940 Attaching to program: /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads, process 5940 warning: process 5940 is a zombie - the process has already terminated ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ptrace: Operation not permitted. (gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 1: attach info threads No threads. (gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 1: no new threads set breakpoint always-inserted on (gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 1: set breakpoint always-inserted on Other times the process dies while the test is ongoing (the process is ptrace-stopped): (gdb) print again = 1 Cannot access memory at address 0x6020cc (gdb) FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 2: reset timer in the inferior (Recall that on Linux, SIGKILL is not interceptable) And other times it dies just while we're detaching: $4 = 319 (gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 2: print seconds_left detach Can't detach Thread 0x7fb13b7de700 (LWP 1842): No such process (gdb) FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 2: detach GDB mishandles the latter (it should ignore ESRCH while detaching just like when continuing), but that's another story. The fix here is to change spawn_wait_for_attach to use Expect's 'spawn' command instead of Tcl's 'exec&' to spawn programs, because with spawn we control when to wait for/reap the process. That allows killing the process by PID without being subject to pid-reuse races, because even if the process is already dead, the kernel won't reuse the process's PID until the zombie is reaped. The other part of the problem lies in DejaGnu itself, unfortunately. I have occasionally seen tests (attach-many-short-lived-threads included, but not only that one) die with a random inexplicable SIGTERM too, and that too is caused by the same reason, except that in that case, the rogue SIGTERM is sent from this bit in DejaGnu's remote.exp: exec sh -c "exec > /dev/null 2>&1 && (kill -2 $pgid || kill -2 $pid) && sleep 5 && (kill $pgid || kill $pid) && sleep 5 && (kill -9 $pgid || kill -9 $pid) &" ... catch "wait -i $shell_id" Even if the program exits promptly, that whole cascade of kills carries on in the background, thus potentially killing the poor process that manages to reuse $pid... I sent a fix for that to the DejaGnu list: http://lists.gnu.org/archive/html/dejagnu/2015-07/msg00000.html With both patches in place, I haven't seen attach-many-short-lived-threads.exp fail again. Tested on x86_64 Fedora 20, native, gdbserver and extended-gdbserver. gdb/testsuite/ChangeLog: 2015-07-31 Pedro Alves <palves@redhat.com> * gdb.base/attach-pie-misread.exp: Rename $res to $test_spawn_id. Use spawn_id_get_pid. Wait for spawn id after eof. Use kill_wait_spawned_process instead of explicit "kill -9". * gdb.base/attach-pie-noexec.exp: Adjust to spawn_wait_for_attach returning a spawn id instead of a pid. Use spawn_id_get_pid and kill_wait_spawned_process. * gdb.base/attach-twice.exp: Likewise. * gdb.base/attach.exp: Likewise. (do_command_attach_tests): Use gdb_spawn_with_cmdline_opts and gdb_test_multiple. * gdb.base/solib-overlap.exp: Adjust to spawn_wait_for_attach returning a spawn id instead of a pid. Use spawn_id_get_pid and kill_wait_spawned_process. * gdb.base/valgrind-infcall.exp: Likewise. * gdb.multi/multi-attach.exp: Likewise. * gdb.python/py-prompt.exp: Likewise. * gdb.python/py-sync-interp.exp: Likewise. * gdb.server/ext-attach.exp: Likewise. * gdb.threads/attach-into-signal.exp (corefunc): Use spawn_wait_for_attach, spawn_id_get_pid and kill_wait_spawned_process. * gdb.threads/attach-many-short-lived-threads.exp: Adjust to spawn_wait_for_attach returning a spawn id instead of a pid. Use spawn_id_get_pid and kill_wait_spawned_process. * gdb.threads/attach-stopped.exp (corefunc): Use spawn_wait_for_attach, spawn_id_get_pid and kill_wait_spawned_process. * gdb.base/break-interp.exp: Rename $res to $test_spawn_id. Use spawn_id_get_pid. Wait for spawn id after eof. Use kill_wait_spawned_process instead of explicit "kill -9". * lib/gdb.exp (can_spawn_for_attach): Adjust comment. (kill_wait_spawned_process, spawn_id_get_pid): New procedures. (spawn_wait_for_attach): Use spawn instead of exec to spawn processes. Don't map cygwin/windows pids here. Now returns a spawn id list.
2015-07-31 21:06:24 +02:00
kill_wait_spawned_process $valgrind_spawn_id