Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 330721 - Unable to get dsh use case to work
Summary: Unable to get dsh use case to work
Status: CLOSED FIXED
Alias: None
Product: PTP
Classification: Tools
Component: SCI (show other bugs)
Version: 5.0   Edit
Hardware: PC Linux
: P3 normal (vote)
Target Milestone: ---   Edit
Assignee: rong li CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-11-19 17:34 EST by Greg Watson CLA
Modified: 2013-05-27 09:09 EDT (History)
2 users (show)

See Also:


Attachments
To enable the listener to listen both the ipv4/ ipv6 port (147 bytes, patch)
2010-11-20 03:43 EST, rong li CLA
g.watson: iplog+
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Greg Watson CLA 2010-11-19 17:34:05 EST
I'm unable to get the dsh use case to work. It just hangs. System is Linux FC11.

Steps to reproduce:

1. Build and install SCI (from HEAD) in /usr/local/sci
2. Add /usr/local/sci/bin to PATH
3. Add /usr/local/sci/lib to LD_LIBRARY_PATH
4. Change to usecase/dsh and run "make all_32"
5. Create "host.list" with one line "localhost"
6. Run the command ./use_ext_launcher

Output:

	Start back ends ...
	export SCI_JOB_KEY=19323; export SCI_USE_EXTLAUNCHER=yes; export SCI_CLIENT_ID=0; ./dsh_be
	Wait 5 seconds ...
	Start front end ...
	export SCI_JOB_KEY=19323; export SCI_USE_EXTLAUNCHER=yes; ./dsh_fe

I can't do anything further at this point. After about a minute, I see the following:

*** glibc detected *** ./dsh_be: double free or corruption (fasttop): 0x094d0230 ***
======= Backtrace: =========
/lib/libc.so.6[0xc162a1]
/lib/libc.so.6(freeaddrinfo+0x38)[0xc68b98]
/usr/local/sci/lib/libsci.so.0(_ZN6Socket7connectEPKct+0x244)[0x28e7c4]
/usr/local/sci/lib/libsci.so.0(_ZN6Stream4initEPKct+0x72)[0x28faa2]
/usr/local/sci/lib/libsci.so.0(_ZN11Initializer9initExtBEEi+0x1af)[0x2734ef]
/usr/local/sci/lib/libsci.so.0(_ZN11Initializer6initBEEv+0x652)[0x2743d2]
/usr/local/sci/lib/libsci.so.0(_ZN11Initializer4initEv+0x1e9)[0x274e39]
/usr/local/sci/lib/libsci.so.0(SCI_Initialize+0x78)[0x26e888]
./dsh_be[0x8048971]
/lib/libc.so.6(__libc_start_main+0xe6)[0xbbca66]
./dsh_be[0x8048751]
======= Memory map: ========
00259000-0029b000 r-xp 00000000 fd:00 166867     /usr/local/sci/lib/libsci.so.0.0.0
0029b000-0029c000 rw-p 00042000 fd:00 166867     /usr/local/sci/lib/libsci.so.0.0.0
0029c000-0029d000 rw-p 0029c000 00:00 0 
00479000-0047a000 r-xp 00479000 00:00 0          [vdso]
007a0000-007ab000 r-xp 00000000 fd:00 143260     /lib/libnss_files-2.10.1.so
007ab000-007ac000 r--p 0000a000 fd:00 143260     /lib/libnss_files-2.10.1.so
007ac000-007ad000 rw-p 0000b000 fd:00 143260     /lib/libnss_files-2.10.1.so
00add000-00b07000 r-xp 00000000 fd:00 145982     /lib/libgcc_s-4.4.1-20090729.so.1
00b07000-00b08000 rw-p 00029000 fd:00 145982     /lib/libgcc_s-4.4.1-20090729.so.1
00b82000-00ba2000 r-xp 00000000 fd:00 131765     /lib/ld-2.10.1.so
00ba2000-00ba3000 r--p 0001f000 fd:00 131765     /lib/ld-2.10.1.so
00ba3000-00ba4000 rw-p 00020000 fd:00 131765     /lib/ld-2.10.1.so
00ba6000-00d11000 r-xp 00000000 fd:00 131766     /lib/libc-2.10.1.so
00d11000-00d13000 r--p 0016b000 fd:00 131766     /lib/libc-2.10.1.so
00d13000-00d14000 rw-p 0016d000 fd:00 131766     /lib/libc-2.10.1.so
00d14000-00d17000 rw-p 00d14000 00:00 0 
00d19000-00d3f000 r-xp 00000000 fd:00 145816     /lib/libm-2.10.1.so
00d3f000-00d40000 r--p 00025000 fd:00 145816     /lib/libm-2.10.1.so
00d40000-00d41000 rw-p 00026000 fd:00 145816     /lib/libm-2.10.1.so
00d43000-00d46000 r-xp 00000000 fd:00 131772     /lib/libdl-2.10.1.so
00d46000-00d47000 r--p 00002000 fd:00 131772     /lib/libdl-2.10.1.so
00d47000-00d48000 rw-p 00003000 fd:00 131772     /lib/libdl-2.10.1.so
00d4a000-00d60000 r-xp 00000000 fd:00 131767     /lib/libpthread-2.10.1.so
00d60000-00d61000 ---p 00016000 fd:00 131767     /lib/libpthread-2.10.1.so
00d61000-00d62000 r--p 00016000 fd:00 131767     /lib/libpthread-2.10.1.so
00d62000-00d63000 rw-p 00017000 fd:00 131767     /lib/libpthread-2.10.1.so
00d63000-00d65000 rw-p 00d63000 00:00 0 
00d9c000-00da3000 r-xp 00000000 fd:00 131768     /lib/librt-2.10.1.so
00da3000-00da4000 r--p 00006000 fd:00 131768     /lib/librt-2.10.1.so
00da4000-00da5000 rw-p 00007000 fd:00 131768     /lib/librt-2.10.1.so
02000000-020e3000 r-xp 00000000 fd:00 8252       /usr/lib/libstdc++.so.6.0.12
020e3000-020e7000 r--p 000e2000 fd:00 8252       /usr/lib/libstdc++.so.6.0.12
020e7000-020e9000 rw-p 000e6000 fd:00 8252       /usr/lib/libstdc++.so.6.0.12
020e9000-020ef000 rw-p 020e9000 00:00 0 
08048000-08049000 r-xp 00000000 fd:00 166948     /home/greg/org.eclipse.ptp.sci/usecase/dsh/dsh_be
08049000-0804a000 rw-p 00000000 fd:00 166948     /home/greg/org.eclipse.ptp.sci/usecase/dsh/dsh_be
094cb000-094ed000 rw-p 094cb000 00:00 0          [heap]
b7f85000-b7f87000 rw-p b7f85000 00:00 0 
b7f95000-b7f96000 rw-p b7f95000 00:00 0 
bfd81000-bfd96000 rw-p bffeb000 00:00 0          [stack]
Comment 1 Greg Watson CLA 2010-11-19 17:55:20 EST
Update: I've been unable to get any of the use cases to work. They all seem to have the same problem.
Comment 2 Tu Hong Jun CLA 2010-11-20 00:08:33 EST
This issue has been fixed, Jessica will attach the fix very soon and I will process it.
Comment 3 Tu Hong Jun CLA 2010-11-20 00:09:58 EST
The main reason for this is that SCID only listens on an IPv4 address and it should listen on all the addresses.
Comment 4 rong li CLA 2010-11-20 03:43:45 EST
Created attachment 183514 [details]
To enable the listener to listen both the ipv4/ ipv6 port

To enable the listener to listen both the ipv4/ ipv6 port
Comment 5 Tu Hong Jun CLA 2010-11-21 00:26:06 EST
I have applied the patch, but, Jessica, you don't have to comment out the unused code, delete is fine.
Comment 6 Greg Watson CLA 2010-11-21 10:23:40 EST
The patch appears to have fixed the double free problem, however the dsh use case still does not work.
Comment 7 Greg Watson CLA 2010-11-21 10:24:13 EST
Also, please assign the bug to someone (not ptp-inbox) before marking it as fixed. Thanks.
Comment 8 Tu Hong Jun CLA 2010-11-21 20:47:39 EST
I have applied the patch, but, Jessica, you don't have to comment out the unused code, delete is fine.
Comment 9 Tu Hong Jun CLA 2010-11-21 20:48:28 EST
Have you tried? the delete of 'delet' should enable SCID to listen on all the IP addresses.
Comment 10 Greg Watson CLA 2010-11-22 09:33:18 EST
I have tried the patch with no success. I enabled logging and this is the output from dsh_fe:

101122-09:22:28[DEBUG] I am a front end, my handle is -1 (initializer.cpp:105|30
86940656)
101122-09:22:28[DEBUG] Hostlist is:  (topology.cpp:109|3086940656)
101122-09:22:28[DEBUG] localhost (topology.cpp:125|3086940656)
101122-09:22:28[DEBUG] Processor Handler: started (processor.cpp:54|3086936944)
101122-09:22:28[DEBUG] Processor Router: started (processor.cpp:54|3074423664)
101122-09:22:28[DEBUG] Processor UpstreamFilter: started (processor.cpp:54|30618
40752)
101122-09:22:28[DEBUG] Processor Router: processing a message, type=-1001, filte
r ID=-1, group=-1, size=120 (processor.cpp:69|3074423664)
101122-09:22:28[DEBUG] Launcher: env(;LD_LIBRARY_PATH=/usr/local/sci/lib;SCI_AGE
NT_PATH=/usr/local/sci/bin/scia;SCI_ENABLE_FAILOVER=no;SCI_JOB_KEY=23800;SCI_LOG
_DIRECTORY=/tmp;SCI_LOG_LEVEL=4;SCI_REMOTE_SHELL=;SCI_USE_EXTLAUNCHER=yes;SCI_WO
RK_DIRECTORY=/home/greg/org.eclipse.ptp.sci/usecase/dsh) (launcher.cpp:191|30744
23664)
101122-09:22:28[DEBUG] listener binded to port 55580 (listener.cpp:72|3074423664
)
101122-09:22:28[DEBUG] Launch client: localhost: /home/greg/org.eclipse.ptp.sci/
usecase/dsh/dsh_be (launcher.cpp:316|3074423664)

It looks to me like this is still trying to launch dsh_be, even though it is started manually.

If I run dsh_fe directly, it also hangs. The following debug output is generated:

101122-09:26:10[DEBUG] I am a front end, my handle is -1 (initializer.cpp:105|30
87935984)
101122-09:26:10[DEBUG] Hostlist is:  (topology.cpp:109|3087935984)
101122-09:26:10[DEBUG] localhost (topology.cpp:125|3087935984)
101122-09:26:10[DEBUG] Processor Handler: started (processor.cpp:54|3087932272)
101122-09:26:10[DEBUG] Processor UpstreamFilter: started (processor.cpp:54|30649
82384)
101122-09:26:10[DEBUG] Processor Router: started (processor.cpp:54|3075472240)
101122-09:26:10[DEBUG] Processor Router: processing a message, type=-1001, filte
r ID=-1, group=-1, size=120 (processor.cpp:69|3075472240)
101122-09:26:10[DEBUG] Launcher: env(;LD_LIBRARY_PATH=/usr/local/sci/lib;SCI_AGE
NT_PATH=/usr/local/sci/bin/scia;SCI_ENABLE_FAILOVER=no;SCI_JOB_KEY=413387725;SCI
_LOG_DIRECTORY=/tmp;SCI_LOG_LEVEL=4;SCI_REMOTE_SHELL=;SCI_USE_EXTLAUNCHER=no;SCI
_WORK_DIRECTORY=/home/greg/org.eclipse.ptp.sci/usecase/dsh) (launcher.cpp:191|30
75472240)
101122-09:26:10[DEBUG] Launch client: localhost: /home/greg/org.eclipse.ptp.sci/
usecase/dsh/dsh_be (launcher.cpp:316|3075472240)
Comment 11 Tu Hong Jun CLA 2010-11-23 05:46:01 EST
That's really weird, did you recompile scid and restart it, can you try netstat -nap | grep scid to see what addresses it is listening on.
Comment 12 Greg Watson CLA 2010-11-23 11:29:40 EST
I'm able to run dsh after starting the scid. However, I'm trying to start it using the use_ext_launcher script (which sets SCI_USE_EXTLAUNCHER=yes) with no scid running. This is when it is hanging.
Comment 13 Greg Watson CLA 2010-11-29 15:04:54 EST
Any suggestions on why this doesn't work? Unless I can get the external launch mode working, SCI is useless for the PTP debugger.
Comment 14 Tu Hong Jun CLA 2010-11-30 20:49:48 EST
Sorry I thought you have made things work after you started scid, scid must be started especially using external launching mode. If it is not, the front-end will keep retrying 200 times before finish.
Comment 15 Greg Watson CLA 2010-12-01 08:28:12 EST
Why is the scid required? Unless the dependency on scid can be removed, this means that SCI will not be usable for the PTP debugger.
Comment 16 Tu Hong Jun CLA 2010-12-01 20:55:48 EST
The external launching mode means the back ends are launched by a third party launcher and they need to connect back to their parents. The scid is used for them to look up their parents' hostname and port number which they have to know. It is possible to bypass this like what is doing in POE with MDCR, but that is complicated because users need to find their own way to tell the back ends where to connect, if you do want to use it that way, I can tell you how.
Comment 17 Greg Watson CLA 2010-12-02 08:58:03 EST
Yes please let me know how. PTP will not be able to use SCI unless it's possible to launch without scid.
Comment 18 Tu Hong Jun CLA 2010-12-02 20:27:36 EST
The external launching mode means the back ends are launched by a third party launcher and they need to connect back to their parents. The scid is used for them to look up their parents' hostname and port number which they have to know. It is possible to bypass this like what is doing in POE with MDCR, but that is complicated because users need to find their own way to tell the back ends where to connect, if you do want to use it that way, I can tell you how.
Comment 19 Greg Watson CLA 2010-12-02 20:47:17 EST
Yes, please describe how it works.
Comment 20 Tu Hong Jun CLA 2010-12-02 21:15:36 EST
To connect back to their parents, all the back ends need to know are some environment variables. SCI_PARENT_HOSTNAME, SCI_PARENT_PORT and SCI_PARENT_ID besides SCI_CLIENT_ID and SCI_JOB_KEY. What kind of mechanism do you want to use to tell the back ends those information, if you don't want to use scid, do you want to use scia or you want to use the embedded agents? Could you tell me your requirement then Jessica and I can find a best way for you.
Comment 21 Greg Watson CLA 2010-12-03 09:54:38 EST
I would like to be able to use both scia and embedded agents. 

My backend knows the parent hostname, port and ID along with its client ID. Some questions:

1. Will it work to set these environment variables prior to calling SCI_Initialize in the backend? 
2. Do I use a parent ID of -1 to specify the frontend? 
3. Does the frontend need to be running before the backends are launched?

I don't see this described in the documentation anywhere. Please update the documentation with a description of how it works.

Thanks!
Comment 22 Greg Watson CLA 2010-12-03 10:15:06 EST
4. How does an agent determine the port number it will listen on (equivalent to SCI_PARENT_PORT)?
Comment 23 Tu Hong Jun CLA 2010-12-03 13:34:40 EST
scid is designed for the external launching mode, so that will be the most convenient way as users do not have to do any special things. If you do want not to use scid, it is also possible but that is not very recommended so there isn't much documentation to describe that for now, I can update the document after you are satisfied with the following example:
For standalone agent:
1. write a script e.g runbe.sh which takes IP envStr and a path as input auguments:

#!/bin/bash

REMOTE_IP=$1
ENVSTR=$2
CLIENT_PATH=$3

export $ENVSTR

if [ $SCI_CLIENT_ID -ge 0 ]; then
    ssh $REMOTE_IP -n "$ENVSTR $CLIENT_PATH >&- 2>&- <&- &"
else
    ssh $REMOTE_IP "echo $ENVSTR > /tmp/$SCI_JOB_KEY.$SCI_CLIENT_ID"
fi

2. set the env SCI_REMOTE_SHELL=/path/to/runbe.sh (you need to configure your ssh logon without password or replace ssh with rsh in the script if you prefer and it is password-less) together with the env SCI_USE_EXTLAUNCHER=yes

3. Before the back end issues SCI_Initialize, it must source the envs in the file /tmp/$SCI_JOB_KEY.$SCI_CLIENT_ID (the file its parent generated).
Comment 24 Greg Watson CLA 2010-12-03 14:49:46 EST
The PTP debugger must be user installable, so the use of scid or any daemon that requires system privileges is not possible.

This script appears to require the frontend to ssh to every node. If this is the case, then it is not scalable enough for use with PTP. It also seems like a huge kludge.

Let me describe what we need, then perhaps you can modify SCI to support this launch method. Otherwise I don't think SCI will be suitable for use with PTP.

1. The frontend is launched and waits for incoming connections from backends/agents on a well known port. This port could be specified by an environment variable or passed to SCI_Initialize.
2. The backends/agents are launched using an external launcher. They have no knowledge of how they are launched and they are not launched by the frontend. 
3. The backends/agents have access to routing information that tells each backend/agent its ID, the port number to listen for children on (if it's not a leaf), the ID of its parent, the node the parent is running on, and the port number the parent is listening on.
4. If the backend/agent is not a leaf, it listens for children on the provided port.
4. Each backend/agent connects to its parent using its parent's node name and port number.
5. The frontend waits until all the backends/agents have connected (either to their parents or to the frontend).
Comment 25 Greg Watson CLA 2010-12-03 18:31:35 EST
It occurs to me that there might be another way around this problem.

If scid did not need to run as root, then rather than using mpirun to launch the debugger backends/agents, it could be used to launch scid's instead. The debugger launch would then consist of the following steps:

1. The scid's are launched using mpirun.
2. The frontend is launched with a hostlist containing all the hosts that the scid's were launched on.
3. Each scid forks a backend/agent which then establish communication using the usual SCI mechanism.
4. The environment is passed from the scid to the backend/agent.
5. The scid's shut down once the backend/agents have started.

I've tried this for the simple case on a local machine using the dsh use case and it appears to work.

Is there any particular reason that scid needs to run as root if it is only going to be starting an agent for a single user? Also, is there any way to tell the scid's to shut down?
Comment 26 Tu Hong Jun CLA 2010-12-04 12:41:49 EST
That script doesn't mean the front end to ssh to every node, instead, the front end will use that script to launch its direct children (first layer agents if the topology has more than one layer), and the second layer will also use that script to launch their own direct children, and so on until the leaves are launched. In fact, the script mechanism follows exactly the same topology when using scid, it just replaces the messages exchanges with scid. Before scid was implemented, SCI used SCI_REMOTE_SHELL=ssh/rsh to launch agents/back ends. 

I think using scid should be simpler, you are right if user only needs to launch their own job, scid doesn't have to have root privilege, the code can be changed easily. To shutdown scid, can you use mpirun to kill? The only issue comes to my mind is the port number, scid must use a well-known port number and which can be specified through SCI_DAEMON_PORT. Both the library and the daemon use that env.
Comment 27 Tu Hong Jun CLA 2010-12-06 05:25:44 EST
As I tested the external launching mode without scid, I didn't find any problem. Below are the envs I used, there is only one entry in my host.list
$ cat host.list
localhost

FE:
$env | grep SCI
SCI_USE_EXTLAUNCHER=yes
SCI_REMOTE_SHELL=true
SCI_AGENT_PATH=/opt/sci/bin
SCI_LIB_PATH=/opt/sci/lib64
SCI_LISTENER_PORT=44444
SCI_JOB_KEY=12345
SCI_EMBED_AGENT=yes

BE:
SCI_USE_EXTLAUNCHER=yes
SCI_PARENT_ID=-1
SCI_PARENT_HOSTNAME=localhost
SCI_PARENT_PORT=44444
SCI_CLIENT_ID=0
SCI_JOB_KEY=12345
SCI_EMBED_AGENT=yes

The result is the dsh_be64 hanging and the dsh_fe got
>>> ls
0: buildit
0: CVS
0: ddsh_fe64
0: ddsh_fe.c
0: dsh_be64
0: dsh_be.c
0: dsh_fe64
0: dsh_fe.c
0: edsh_fe64
0: edsh_fe.cpp
0: gdsh_fe64
0: gdsh_fe.cpp
0: host.list
0: Makefile
0: Makefile.aix
0: use_ext_launcher
0: use_ext_launcher2
>>>
Comment 28 Greg Watson CLA 2010-12-06 18:54:04 EST
Yes, I am able to get this to work now. The main problem was that my host.list file had two entries for localhost, which was causing the fe to hang (presumably waiting for a second be).

It would be good if this could be documented somewhere (what environment variables are required, and why).

Thanks!
Comment 29 Tu Hong Jun CLA 2010-12-08 02:09:18 EST
We can update the document SCI-introduction http://wiki.eclipse.org/PTP/designs/SCI#Introduction in this website. There are few changes needed, I can point it out to you or may I update it directly?
Comment 30 Greg Watson CLA 2010-12-08 06:20:49 EST
Please update it directly. Thanks!
Comment 31 rong li CLA 2010-12-08 08:50:52 EST
I have added the description of the following environment variables as below. Pls help review. Thanks.
As I'm still not a committer right now, I would not have premission to do an update directly by myself.

1). SCI_DAEMON_NAME: the name of the sci daemon. This environment variable need to be specified for both the sci library and sci daemon, If it is set, the library and the daemon will use the specified daemon name to query the entry in ¡®/etc/services¡¯ to get the daemon port.

2). SCI_PARENT_HOSTNAME:  the host name of the parent (the front end or the agent). No default value.
 
3). SCI_PARENT_ID:  the id of the parent (the front end or the agent). This value should be set to its parent's id, such as ¡®-1¡¯(for front end), ¡®-2¡¯(for one agent). No default value.

4). SCI_PARENT_PORT:  the port which the parent is listening to. The child (the back end or the child agent) can use this port to connect back to its parent. No default value.
Comment 32 Greg Watson CLA 2010-12-08 16:04:35 EST
You don't need to be a committer to update the wiki. You can log into the wiki using your bugzilla ID.
Comment 33 Tu Hong Jun CLA 2010-12-12 11:20:25 EST
Hi Greg,

I think I should have fixed the problem of more than one addresses (IPv4 & IPv6 together) as I saw your comments in the code, could you have a try to verify it on your Mac OS when you have time, thanks.

I also changed the 'Launch modes' section in the wiki page http://wiki.eclipse.org/PTP/designs/SCI#Launch_modes, it will be nice if you can help to re-word the sentences to improve the readability if necessary, thanks in advance.

Jessica, I think you can add all the descriptions of the environment variables from the 5XX design into this wiki.
Comment 34 rong li CLA 2012-11-06 05:03:54 EST
The changes have already been committed into the branch.