Community
Participate
Working Groups
I'm unable to get the dsh use case to work. It just hangs. System is Linux FC11. Steps to reproduce: 1. Build and install SCI (from HEAD) in /usr/local/sci 2. Add /usr/local/sci/bin to PATH 3. Add /usr/local/sci/lib to LD_LIBRARY_PATH 4. Change to usecase/dsh and run "make all_32" 5. Create "host.list" with one line "localhost" 6. Run the command ./use_ext_launcher Output: Start back ends ... export SCI_JOB_KEY=19323; export SCI_USE_EXTLAUNCHER=yes; export SCI_CLIENT_ID=0; ./dsh_be Wait 5 seconds ... Start front end ... export SCI_JOB_KEY=19323; export SCI_USE_EXTLAUNCHER=yes; ./dsh_fe I can't do anything further at this point. After about a minute, I see the following: *** glibc detected *** ./dsh_be: double free or corruption (fasttop): 0x094d0230 *** ======= Backtrace: ========= /lib/libc.so.6[0xc162a1] /lib/libc.so.6(freeaddrinfo+0x38)[0xc68b98] /usr/local/sci/lib/libsci.so.0(_ZN6Socket7connectEPKct+0x244)[0x28e7c4] /usr/local/sci/lib/libsci.so.0(_ZN6Stream4initEPKct+0x72)[0x28faa2] /usr/local/sci/lib/libsci.so.0(_ZN11Initializer9initExtBEEi+0x1af)[0x2734ef] /usr/local/sci/lib/libsci.so.0(_ZN11Initializer6initBEEv+0x652)[0x2743d2] /usr/local/sci/lib/libsci.so.0(_ZN11Initializer4initEv+0x1e9)[0x274e39] /usr/local/sci/lib/libsci.so.0(SCI_Initialize+0x78)[0x26e888] ./dsh_be[0x8048971] /lib/libc.so.6(__libc_start_main+0xe6)[0xbbca66] ./dsh_be[0x8048751] ======= Memory map: ======== 00259000-0029b000 r-xp 00000000 fd:00 166867 /usr/local/sci/lib/libsci.so.0.0.0 0029b000-0029c000 rw-p 00042000 fd:00 166867 /usr/local/sci/lib/libsci.so.0.0.0 0029c000-0029d000 rw-p 0029c000 00:00 0 00479000-0047a000 r-xp 00479000 00:00 0 [vdso] 007a0000-007ab000 r-xp 00000000 fd:00 143260 /lib/libnss_files-2.10.1.so 007ab000-007ac000 r--p 0000a000 fd:00 143260 /lib/libnss_files-2.10.1.so 007ac000-007ad000 rw-p 0000b000 fd:00 143260 /lib/libnss_files-2.10.1.so 00add000-00b07000 r-xp 00000000 fd:00 145982 /lib/libgcc_s-4.4.1-20090729.so.1 00b07000-00b08000 rw-p 00029000 fd:00 145982 /lib/libgcc_s-4.4.1-20090729.so.1 00b82000-00ba2000 r-xp 00000000 fd:00 131765 /lib/ld-2.10.1.so 00ba2000-00ba3000 r--p 0001f000 fd:00 131765 /lib/ld-2.10.1.so 00ba3000-00ba4000 rw-p 00020000 fd:00 131765 /lib/ld-2.10.1.so 00ba6000-00d11000 r-xp 00000000 fd:00 131766 /lib/libc-2.10.1.so 00d11000-00d13000 r--p 0016b000 fd:00 131766 /lib/libc-2.10.1.so 00d13000-00d14000 rw-p 0016d000 fd:00 131766 /lib/libc-2.10.1.so 00d14000-00d17000 rw-p 00d14000 00:00 0 00d19000-00d3f000 r-xp 00000000 fd:00 145816 /lib/libm-2.10.1.so 00d3f000-00d40000 r--p 00025000 fd:00 145816 /lib/libm-2.10.1.so 00d40000-00d41000 rw-p 00026000 fd:00 145816 /lib/libm-2.10.1.so 00d43000-00d46000 r-xp 00000000 fd:00 131772 /lib/libdl-2.10.1.so 00d46000-00d47000 r--p 00002000 fd:00 131772 /lib/libdl-2.10.1.so 00d47000-00d48000 rw-p 00003000 fd:00 131772 /lib/libdl-2.10.1.so 00d4a000-00d60000 r-xp 00000000 fd:00 131767 /lib/libpthread-2.10.1.so 00d60000-00d61000 ---p 00016000 fd:00 131767 /lib/libpthread-2.10.1.so 00d61000-00d62000 r--p 00016000 fd:00 131767 /lib/libpthread-2.10.1.so 00d62000-00d63000 rw-p 00017000 fd:00 131767 /lib/libpthread-2.10.1.so 00d63000-00d65000 rw-p 00d63000 00:00 0 00d9c000-00da3000 r-xp 00000000 fd:00 131768 /lib/librt-2.10.1.so 00da3000-00da4000 r--p 00006000 fd:00 131768 /lib/librt-2.10.1.so 00da4000-00da5000 rw-p 00007000 fd:00 131768 /lib/librt-2.10.1.so 02000000-020e3000 r-xp 00000000 fd:00 8252 /usr/lib/libstdc++.so.6.0.12 020e3000-020e7000 r--p 000e2000 fd:00 8252 /usr/lib/libstdc++.so.6.0.12 020e7000-020e9000 rw-p 000e6000 fd:00 8252 /usr/lib/libstdc++.so.6.0.12 020e9000-020ef000 rw-p 020e9000 00:00 0 08048000-08049000 r-xp 00000000 fd:00 166948 /home/greg/org.eclipse.ptp.sci/usecase/dsh/dsh_be 08049000-0804a000 rw-p 00000000 fd:00 166948 /home/greg/org.eclipse.ptp.sci/usecase/dsh/dsh_be 094cb000-094ed000 rw-p 094cb000 00:00 0 [heap] b7f85000-b7f87000 rw-p b7f85000 00:00 0 b7f95000-b7f96000 rw-p b7f95000 00:00 0 bfd81000-bfd96000 rw-p bffeb000 00:00 0 [stack]
Update: I've been unable to get any of the use cases to work. They all seem to have the same problem.
This issue has been fixed, Jessica will attach the fix very soon and I will process it.
The main reason for this is that SCID only listens on an IPv4 address and it should listen on all the addresses.
Created attachment 183514 [details] To enable the listener to listen both the ipv4/ ipv6 port To enable the listener to listen both the ipv4/ ipv6 port
I have applied the patch, but, Jessica, you don't have to comment out the unused code, delete is fine.
The patch appears to have fixed the double free problem, however the dsh use case still does not work.
Also, please assign the bug to someone (not ptp-inbox) before marking it as fixed. Thanks.
Have you tried? the delete of 'delet' should enable SCID to listen on all the IP addresses.
I have tried the patch with no success. I enabled logging and this is the output from dsh_fe: 101122-09:22:28[DEBUG] I am a front end, my handle is -1 (initializer.cpp:105|30 86940656) 101122-09:22:28[DEBUG] Hostlist is: (topology.cpp:109|3086940656) 101122-09:22:28[DEBUG] localhost (topology.cpp:125|3086940656) 101122-09:22:28[DEBUG] Processor Handler: started (processor.cpp:54|3086936944) 101122-09:22:28[DEBUG] Processor Router: started (processor.cpp:54|3074423664) 101122-09:22:28[DEBUG] Processor UpstreamFilter: started (processor.cpp:54|30618 40752) 101122-09:22:28[DEBUG] Processor Router: processing a message, type=-1001, filte r ID=-1, group=-1, size=120 (processor.cpp:69|3074423664) 101122-09:22:28[DEBUG] Launcher: env(;LD_LIBRARY_PATH=/usr/local/sci/lib;SCI_AGE NT_PATH=/usr/local/sci/bin/scia;SCI_ENABLE_FAILOVER=no;SCI_JOB_KEY=23800;SCI_LOG _DIRECTORY=/tmp;SCI_LOG_LEVEL=4;SCI_REMOTE_SHELL=;SCI_USE_EXTLAUNCHER=yes;SCI_WO RK_DIRECTORY=/home/greg/org.eclipse.ptp.sci/usecase/dsh) (launcher.cpp:191|30744 23664) 101122-09:22:28[DEBUG] listener binded to port 55580 (listener.cpp:72|3074423664 ) 101122-09:22:28[DEBUG] Launch client: localhost: /home/greg/org.eclipse.ptp.sci/ usecase/dsh/dsh_be (launcher.cpp:316|3074423664) It looks to me like this is still trying to launch dsh_be, even though it is started manually. If I run dsh_fe directly, it also hangs. The following debug output is generated: 101122-09:26:10[DEBUG] I am a front end, my handle is -1 (initializer.cpp:105|30 87935984) 101122-09:26:10[DEBUG] Hostlist is: (topology.cpp:109|3087935984) 101122-09:26:10[DEBUG] localhost (topology.cpp:125|3087935984) 101122-09:26:10[DEBUG] Processor Handler: started (processor.cpp:54|3087932272) 101122-09:26:10[DEBUG] Processor UpstreamFilter: started (processor.cpp:54|30649 82384) 101122-09:26:10[DEBUG] Processor Router: started (processor.cpp:54|3075472240) 101122-09:26:10[DEBUG] Processor Router: processing a message, type=-1001, filte r ID=-1, group=-1, size=120 (processor.cpp:69|3075472240) 101122-09:26:10[DEBUG] Launcher: env(;LD_LIBRARY_PATH=/usr/local/sci/lib;SCI_AGE NT_PATH=/usr/local/sci/bin/scia;SCI_ENABLE_FAILOVER=no;SCI_JOB_KEY=413387725;SCI _LOG_DIRECTORY=/tmp;SCI_LOG_LEVEL=4;SCI_REMOTE_SHELL=;SCI_USE_EXTLAUNCHER=no;SCI _WORK_DIRECTORY=/home/greg/org.eclipse.ptp.sci/usecase/dsh) (launcher.cpp:191|30 75472240) 101122-09:26:10[DEBUG] Launch client: localhost: /home/greg/org.eclipse.ptp.sci/ usecase/dsh/dsh_be (launcher.cpp:316|3075472240)
That's really weird, did you recompile scid and restart it, can you try netstat -nap | grep scid to see what addresses it is listening on.
I'm able to run dsh after starting the scid. However, I'm trying to start it using the use_ext_launcher script (which sets SCI_USE_EXTLAUNCHER=yes) with no scid running. This is when it is hanging.
Any suggestions on why this doesn't work? Unless I can get the external launch mode working, SCI is useless for the PTP debugger.
Sorry I thought you have made things work after you started scid, scid must be started especially using external launching mode. If it is not, the front-end will keep retrying 200 times before finish.
Why is the scid required? Unless the dependency on scid can be removed, this means that SCI will not be usable for the PTP debugger.
The external launching mode means the back ends are launched by a third party launcher and they need to connect back to their parents. The scid is used for them to look up their parents' hostname and port number which they have to know. It is possible to bypass this like what is doing in POE with MDCR, but that is complicated because users need to find their own way to tell the back ends where to connect, if you do want to use it that way, I can tell you how.
Yes please let me know how. PTP will not be able to use SCI unless it's possible to launch without scid.
Yes, please describe how it works.
To connect back to their parents, all the back ends need to know are some environment variables. SCI_PARENT_HOSTNAME, SCI_PARENT_PORT and SCI_PARENT_ID besides SCI_CLIENT_ID and SCI_JOB_KEY. What kind of mechanism do you want to use to tell the back ends those information, if you don't want to use scid, do you want to use scia or you want to use the embedded agents? Could you tell me your requirement then Jessica and I can find a best way for you.
I would like to be able to use both scia and embedded agents. My backend knows the parent hostname, port and ID along with its client ID. Some questions: 1. Will it work to set these environment variables prior to calling SCI_Initialize in the backend? 2. Do I use a parent ID of -1 to specify the frontend? 3. Does the frontend need to be running before the backends are launched? I don't see this described in the documentation anywhere. Please update the documentation with a description of how it works. Thanks!
4. How does an agent determine the port number it will listen on (equivalent to SCI_PARENT_PORT)?
scid is designed for the external launching mode, so that will be the most convenient way as users do not have to do any special things. If you do want not to use scid, it is also possible but that is not very recommended so there isn't much documentation to describe that for now, I can update the document after you are satisfied with the following example: For standalone agent: 1. write a script e.g runbe.sh which takes IP envStr and a path as input auguments: #!/bin/bash REMOTE_IP=$1 ENVSTR=$2 CLIENT_PATH=$3 export $ENVSTR if [ $SCI_CLIENT_ID -ge 0 ]; then ssh $REMOTE_IP -n "$ENVSTR $CLIENT_PATH >&- 2>&- <&- &" else ssh $REMOTE_IP "echo $ENVSTR > /tmp/$SCI_JOB_KEY.$SCI_CLIENT_ID" fi 2. set the env SCI_REMOTE_SHELL=/path/to/runbe.sh (you need to configure your ssh logon without password or replace ssh with rsh in the script if you prefer and it is password-less) together with the env SCI_USE_EXTLAUNCHER=yes 3. Before the back end issues SCI_Initialize, it must source the envs in the file /tmp/$SCI_JOB_KEY.$SCI_CLIENT_ID (the file its parent generated).
The PTP debugger must be user installable, so the use of scid or any daemon that requires system privileges is not possible. This script appears to require the frontend to ssh to every node. If this is the case, then it is not scalable enough for use with PTP. It also seems like a huge kludge. Let me describe what we need, then perhaps you can modify SCI to support this launch method. Otherwise I don't think SCI will be suitable for use with PTP. 1. The frontend is launched and waits for incoming connections from backends/agents on a well known port. This port could be specified by an environment variable or passed to SCI_Initialize. 2. The backends/agents are launched using an external launcher. They have no knowledge of how they are launched and they are not launched by the frontend. 3. The backends/agents have access to routing information that tells each backend/agent its ID, the port number to listen for children on (if it's not a leaf), the ID of its parent, the node the parent is running on, and the port number the parent is listening on. 4. If the backend/agent is not a leaf, it listens for children on the provided port. 4. Each backend/agent connects to its parent using its parent's node name and port number. 5. The frontend waits until all the backends/agents have connected (either to their parents or to the frontend).
It occurs to me that there might be another way around this problem. If scid did not need to run as root, then rather than using mpirun to launch the debugger backends/agents, it could be used to launch scid's instead. The debugger launch would then consist of the following steps: 1. The scid's are launched using mpirun. 2. The frontend is launched with a hostlist containing all the hosts that the scid's were launched on. 3. Each scid forks a backend/agent which then establish communication using the usual SCI mechanism. 4. The environment is passed from the scid to the backend/agent. 5. The scid's shut down once the backend/agents have started. I've tried this for the simple case on a local machine using the dsh use case and it appears to work. Is there any particular reason that scid needs to run as root if it is only going to be starting an agent for a single user? Also, is there any way to tell the scid's to shut down?
That script doesn't mean the front end to ssh to every node, instead, the front end will use that script to launch its direct children (first layer agents if the topology has more than one layer), and the second layer will also use that script to launch their own direct children, and so on until the leaves are launched. In fact, the script mechanism follows exactly the same topology when using scid, it just replaces the messages exchanges with scid. Before scid was implemented, SCI used SCI_REMOTE_SHELL=ssh/rsh to launch agents/back ends. I think using scid should be simpler, you are right if user only needs to launch their own job, scid doesn't have to have root privilege, the code can be changed easily. To shutdown scid, can you use mpirun to kill? The only issue comes to my mind is the port number, scid must use a well-known port number and which can be specified through SCI_DAEMON_PORT. Both the library and the daemon use that env.
As I tested the external launching mode without scid, I didn't find any problem. Below are the envs I used, there is only one entry in my host.list $ cat host.list localhost FE: $env | grep SCI SCI_USE_EXTLAUNCHER=yes SCI_REMOTE_SHELL=true SCI_AGENT_PATH=/opt/sci/bin SCI_LIB_PATH=/opt/sci/lib64 SCI_LISTENER_PORT=44444 SCI_JOB_KEY=12345 SCI_EMBED_AGENT=yes BE: SCI_USE_EXTLAUNCHER=yes SCI_PARENT_ID=-1 SCI_PARENT_HOSTNAME=localhost SCI_PARENT_PORT=44444 SCI_CLIENT_ID=0 SCI_JOB_KEY=12345 SCI_EMBED_AGENT=yes The result is the dsh_be64 hanging and the dsh_fe got >>> ls 0: buildit 0: CVS 0: ddsh_fe64 0: ddsh_fe.c 0: dsh_be64 0: dsh_be.c 0: dsh_fe64 0: dsh_fe.c 0: edsh_fe64 0: edsh_fe.cpp 0: gdsh_fe64 0: gdsh_fe.cpp 0: host.list 0: Makefile 0: Makefile.aix 0: use_ext_launcher 0: use_ext_launcher2 >>>
Yes, I am able to get this to work now. The main problem was that my host.list file had two entries for localhost, which was causing the fe to hang (presumably waiting for a second be). It would be good if this could be documented somewhere (what environment variables are required, and why). Thanks!
We can update the document SCI-introduction http://wiki.eclipse.org/PTP/designs/SCI#Introduction in this website. There are few changes needed, I can point it out to you or may I update it directly?
Please update it directly. Thanks!
I have added the description of the following environment variables as below. Pls help review. Thanks. As I'm still not a committer right now, I would not have premission to do an update directly by myself. 1). SCI_DAEMON_NAME: the name of the sci daemon. This environment variable need to be specified for both the sci library and sci daemon, If it is set, the library and the daemon will use the specified daemon name to query the entry in ¡®/etc/services¡¯ to get the daemon port. 2). SCI_PARENT_HOSTNAME: the host name of the parent (the front end or the agent). No default value. 3). SCI_PARENT_ID: the id of the parent (the front end or the agent). This value should be set to its parent's id, such as ¡®-1¡¯(for front end), ¡®-2¡¯(for one agent). No default value. 4). SCI_PARENT_PORT: the port which the parent is listening to. The child (the back end or the child agent) can use this port to connect back to its parent. No default value.
You don't need to be a committer to update the wiki. You can log into the wiki using your bugzilla ID.
Hi Greg, I think I should have fixed the problem of more than one addresses (IPv4 & IPv6 together) as I saw your comments in the code, could you have a try to verify it on your Mac OS when you have time, thanks. I also changed the 'Launch modes' section in the wiki page http://wiki.eclipse.org/PTP/designs/SCI#Launch_modes, it will be nice if you can help to re-word the sentences to improve the readability if necessary, thanks in advance. Jessica, I think you can add all the descriptions of the environment variables from the 5XX design into this wiki.
The changes have already been committed into the branch.