Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assist SGI to port to Intel MPI with Hydra launcher #14

Closed
dongahn opened this issue Apr 30, 2016 · 55 comments
Closed

Assist SGI to port to Intel MPI with Hydra launcher #14

dongahn opened this issue Apr 30, 2016 · 55 comments

Comments

@dongahn
Copy link
Collaborator

dongahn commented Apr 30, 2016

There is a out of band communication to port LaunchMON to Intel MPI with Hydra environment. Created this ticket to capture any significant issues that may arise for that effort.

@dongahn
Copy link
Collaborator Author

dongahn commented May 6, 2016

Hostlist file fIx added to PR #18 can help with this environment as well.

@dongahn
Copy link
Collaborator Author

dongahn commented May 7, 2016

I pushed a commit into my fork just to start to assist James Southern ([email protected]) with porting STAT/LaunchMON on Intel Hydra for AWE: the commit is here

@dongahn
Copy link
Collaborator Author

dongahn commented May 7, 2016

As you can see from here, the LaunchMON backend API expects its options are found at the end of the command line. So if there are other stuff that mpiexec hydra also append to the backend launch string, launchmon will not proceed.

I guess that's sort of the case from your email:

I realised that I can get Intel MPI to print its own command line arguments via its “-v” flag, so I can make some progress with debugging what is going on. At the moment, I set the following lines in rm_intel_hydra.conf:

RM=intel_hydra
RM_MPIR=STD
RM_launcher=mpiexec.hydra
RM_launcher_id=RM_launcher|sym|i_mpi_hyd_cr_init
RM_jobid=RM_launcher|sym|totalview_jobid|string
RM_launch_helper=mpiexec.hydra
RM_signal_for_kill=SIGINT|SIGINT
RM_fail_detection=true
RM_launch_str=-v -f %l -n %n %d %o --lmonsharedsec=%s --lmonsecchk=%c

This results in the following command line for mpiexec.hydra when running LaunchMON:

mpiexec.hydra -f /nas/store/jsouthern/STAT/hostnamefn.30456 -n 1 /store/jsouthern/tmp/install/bin/STATD --lmonsharedsec=2082992184 --lmonsecchk=548371161 --exec --exec-appnum 0 --exec-proc-count 1 --exec-local-env 0 --exec-wdir /store/jsouthern/STAT --exec-args 3 /store/jsouthern/tmp/install/bin/STATD --lmonsharedsec=2082992184 --lmonsecchk=548371161

I can see that the file hostnamefn. is created in src/linux/sdbg_linux_launchmon.cxx from the proctable, so I guess that these are the places where I need to insert my nodelist. However, there does seem to be an error with the command line. For one, it appears to specify the STATD executable twice. Should this be the case?

When I try to run the mpiexec command myself, it appears that the command line as specified above results in errors (see below). When I remove the second call to STATD, however, there are no errors (although I can’t tell whether or not the daemons attach successfully since the call just waits – presumably for the next part of the LaunchMON code).

jsouthern@r2i7n11 ~/STAT $ mpiexec.hydra –hosts r2i7n11 -n 1 /store/jsouthern/tmp/install/bin/STATD --lmonsharedsec=2082992184 --lmonsecchk=548371161 --exec --exec-appnum 0 --exec-proc-count 1 --exec-local-env 0 --exec-wdir /store/jsouthern/STAT --exec-args 3 /store/jsouthern/tmp/install/bin/STATD --lmonsharedsec=2082992184 --lmonsecchk=548371161
<May 06 06:06:46> (ERROR): LaunchMON-specific arguments have not been passed to the daemon through the command-line arguments.
<May 06 06:06:46> (ERROR): the command line that the user provided could have been truncated.
^C[mpiexec@r2i7n11] Sending Ctrl-C to processes as requested
[mpiexec@r2i7n11] Press Ctrl-C again to force abort
jsouthern@r2i7n11 ~/STAT $
jsouthern@r2i7n11 ~/STAT $
jsouthern@r2i7n11 ~/STAT $ mpiexec.hydra -hosts r2i7n11 -n 1 /store/jsouthern/tmp/install/bin/STATD --lmonsharedsec=2082992184 --lmonsecchk=548371161 --exec --exec-appnum 0 --exec-proc-count 1 --exec-local-env 0 --exec-wdir /store/jsouthern/STAT --exec-args 3
^C[mpiexec@r2i7n11] Sending Ctrl-C to processes as requested
[mpiexec@r2i7n11] Press Ctrl-C again to force abort

@stonydon
Copy link

stonydon commented May 9, 2016

@jsthrn: testing for your GitHub id.

@jsthrn
Copy link
Contributor

jsthrn commented May 10, 2016

The test of my ID worked. I got the email and the link points to my profile.

James

@jsthrn
Copy link
Contributor

jsthrn commented May 10, 2016

I checked out the intel_hydra_prelim branch. Unfortunately I can't get it to build. After updating autotools, I now see the following output:

jsouthern@cy001 ~/launchmon $ CPP="gcc -E -P" CPPFLAGS="-I/store/jsouthern/tmp/install/include -I/store/jsouthern/packages/boost/1.60.0/include" LDFLAGS="-L/store/jsouthern/tmp/install/lib" ./configure --prefix=/store/jsouthern/tmp/install --with-myboost=/store/jsouthern/packages/boost/1.60.0
configure: WARNING: unrecognized options: --with-myboost
checking build system type... x86_64-unknown-linux-gnu
checking host system type... x86_64-unknown-linux-gnu
checking target system type... x86_64-unknown-linux-gnu
checking for pkg-config... /store/jsouthern/packages/pkg-config/0.29.1/bin/pkg-config
checking pkg-config is at least version 0.9.0... yes
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking whether make supports nested variables... (cached) yes
checking whether to enable maintainer-specific portions of Makefiles... no
checking whether to turn on a workaround for slurm's MPIR_partitial_attach_ok bug... no
checking whether to enable debug codes... no
checking whether to enable verbose codes... no
./configure: line 3950: syntax error near unexpected token `1.2.0,'
./configure: line 3950: `AM_PATH_LIBGCRYPT(1.2.0,'

Is this something that you have seen before? I can see that there was a version of libgcrypt in the tools/ directory previously, but now that is missing. Do I need to install a version elsewhere (and then provide a way for automake to see it)?

@jsthrn
Copy link
Contributor

jsthrn commented May 10, 2016

Regarding mpiexec.hydra appending its own flags to the backend, I can certainly see that could be possible (the "--exec-<>" ones). However, there's also the two copies of "/store/jsouthern/tmp/install/bin/STATD --lmonsharedsec=2082992184 --lmonsecchk=548371161" in the command line. One of these is the very last thing, so that would suggest that things should actually be ok.

The full daemon command line (copied from above, but a bit more readable here!) is:

mpiexec.hydra -f /nas/store/jsouthern/STAT/hostnamefn.30456 -n 1 \
    /store/jsouthern/tmp/install/bin/STATD --lmonsharedsec=2082992184 --lmonsecchk=548371161 \
    --exec --exec-appnum 0 --exec-proc-count 1 --exec-local-env 0 \
    --exec-wdir /store/jsouthern/STAT --exec-args 3 /store/jsouthern/tmp/install/bin/STATD \
    --lmonsharedsec=2082992184 --lmonsecchk=548371161

So, this does have the Launchmon options right at the end as required.

Note that for another application I get the following (which also has two copies of the executable - again with one at the end, so maybe that is correct?):

mpiexec.hydra -v -n 4 ./simple  --exec --exec-appnum 0 --exec-proc-count 4 \
    --exec-local-env 0 --exec-wdir /store/jsouthern/STAT --exec-args 1 ./simple

@dongahn
Copy link
Collaborator Author

dongahn commented May 10, 2016

Is this something that you have seen before? I can see that there was a version of libgcrypt in the tools/ directory previously, but now that is missing. Do I need to install a version elsewhere (and then provide a way for automake to see it)?

The bundled grcypt has been deprecated, as the bundled version was getting older and has given problems to various packaging systems. As far as you have a decent gcrypt package installed on your system, this should be okay.

CPP="gcc -E -P" CPPFLAGS="-I/store/jsouthern/tmp/install/include -I/store/jsouthern/packages/boost/1.60.0/include" LDFLAGS="-L/store/jsouthern/tmp/install/lib" ./configure --prefix=/store/jsouthern/tmp/install --with-myboost=/store/jsouthern/packages/boost/1.60.0
configure: WARNING: unrecognized options: --with-myboost

--with-myboost has also been deprecated as well, and a version of boost is now a requirement to build launchmon. Can you make sure the following packages are installed on your system? (What Linux distribution are you using?)

  • libelf-dev
  • libboost-dev
  • munge (this is required for secure handshake. There is a config option that allows you to test LaunchMON without this though)

What happens if you just run once these requirements are satisfied?

% bootstrap
% CPP="gcc -E -P" --prefix=/store/jsouthern/tmp/install

@dongahn
Copy link
Collaborator Author

dongahn commented May 10, 2016

./configure: line 3950: syntax error near unexpected token 1.2.0,' ./configure: line 3950:AM_PATH_LIBGCRYPT(1.2.0,'

Did bootstrap give you any error message about AM_PATH_LIBGCRYPT?

@dongahn
Copy link
Collaborator Author

dongahn commented May 10, 2016

Regarding mpiexec.hydra appending its own flags to the backend, I can certainly see that could be possible (the "--exec-<>" ones). However, there's also the two copies of "/store/jsouthern/tmp/install/bin/STATD --lmonsharedsec=2082992184 --lmonsecchk=548371161" in the command line. One of these is the very last thing, so that would suggest that things should actually be ok.

OK. Thanks. Once you get to the point where you can reproduce the original problem using LaunchMON's own simple test of the new version. Let's tease apart this problem as well.

@jsthrn
Copy link
Contributor

jsthrn commented May 11, 2016

So, after building various packages and updating the Launchmon build, it looks like I can now reproduce the original problem. Output (with "-V" switched off in mpiexec.hydra) is:

jsouthern@r1i3n22 ~/STAT $ ps -u jsouthern
  PID TTY          TIME CMD
51257 pts/0    00:00:00 bash
51258 pts/0    00:00:00 pbs_demux
51317 pts/0    00:00:00 mpirun
51322 pts/0    00:00:00 mpiexec.hydra
51323 pts/0    00:00:00 pmi_proxy
51327 pts/0    00:00:19 simple
51328 pts/0    00:00:19 simple
51329 pts/0    00:00:19 simple
51330 pts/0    00:00:19 simple
51333 ?        00:00:00 sshd
51334 pts/1    00:00:00 bash
51389 pts/1    00:00:00 ps
jsouthern@r1i3n22 ~/STAT $
jsouthern@r1i3n22 ~/STAT $ stat-cl 51322
STAT started at 2016-05-11-06:56:20
Attaching to job launcher (null):51322 and launching tool daemons...

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 51398 RUNNING AT r1i3n22.ib0.smc-default.americas.sgi.com
=   EXIT CODE: 6
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 51398 RUNNING AT r1i3n22.ib0.smc-default.americas.sgi.com
=   EXIT CODE: 6
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
Aborted

Full output (with "-V" enabled) is shown in this file

@dongahn
Copy link
Collaborator Author

dongahn commented May 12, 2016

So, after building various packages and updating the Launchmon build, it looks like I can now reproduce the original problem.

@jsthrn: Progress!

It is kind of difficult to see where the backend daemons die or whether they have even been launched.

Could you quickly run the configure again with the following config option and rebuild?

--enable-verbose=<log_dir>

If this works (and daemons are indeed launched and failed), running your test should dump some output files into <log_dir>. Could you please post them here.

Also kind of curious who's returning 6 as the exit code.

@jsthrn
Copy link
Contributor

jsthrn commented May 12, 2016

I ran with --enable-verbose. The stdout file is attached.

It looks to me like there is a problem with my munge install - which presumably isn't what we saw with the release version of Launchmon as that doesn't use munge!). I will have a look at this and see whether I can work out why the munge.socket.2 file is missing on my system.

By the way, @dongahn we are making progress with getting you access to a test system with our software stack enabled (it will be very old hardware, but that shouldn't be an issue).

@jsthrn
Copy link
Contributor

jsthrn commented May 12, 2016

So, it turned out that I hadn't started the munge daemon, which explains why that didn't work! Once I do that I get more output - and no exit code 6.

Here are the updated be.stdout and be.stderr files.

These now look more like the errors I was seeing previously, with "proc control initialization failed" error messages.

@jsthrn
Copy link
Contributor

jsthrn commented May 12, 2016

@dongahn, I am requesting an account for you on a system now. I've already verified that Launchmon (and the rest of the STAT toolchain) builds and runs on the system.

Please let me know your preferred shell (bash, csh, tcsh, ksh, zsh) and I will submit the request.

@dongahn
Copy link
Collaborator Author

dongahn commented May 12, 2016

great. tcsh should work.

@jsthrn
Copy link
Contributor

jsthrn commented May 12, 2016

Thanks. I submitted the request. Hopefully they should come back to you direct with the logon details. If not then I will forward them to you when I have them.

@dongahn
Copy link
Collaborator Author

dongahn commented May 12, 2016

OK. I looked at the trace and you are much farther along with the munge fix.

Apparently the error is coming out at here. And this is because of the error percolating from the backend's procctl layer from here.

Procctl is the layer responsible for normalizing resource manager (RM)-specific synchronization mechanisms between target MPI job and the tools. RMs implement MPIR debug interface for this purpose but how they implement this is different across different RMs. So LauchMON introduced procctl layer.

Two things:

  1. Your current test case is STAT's attach mode. This is a simpler case to handle in terms of MPI-tool synchronization. In fact, I think the error you are seeing can be addressed by adding Hydra-specific case within switch statements across procctl functions.
  2. To complete the port, however, we will need to address launch mode, which is a bit more complex.

I will take a wild guess and add the case statements to help you address 1 first. Once you get pass that, you may want to get a feasibility that STAT can attach to a hung job.

Then, let's discuss what needs to be done for 2. This could be as simple as you educating me about hydra's MPI-tool synchronization mechanisms and me choosing the right procctl primitives to adjust LaunchMON to hydra.

@dongahn
Copy link
Collaborator Author

dongahn commented May 12, 2016

@jsthrn: By the way, once this port is all done, it will be nice if you can provide us with your environments. As part of #25, @mcfadden8 wants to investigate how much RM-specific stuff we can integrate into Travis CI (as a separate testing instance) and ideally we want to be able to do this for as many RMs as possible, which LaunchMON supports.

Does Intel MPI require a license to use?

@lee218llnl
Copy link
Collaborator

@dongahn Intel MPI does not require a license to run, just install. FYI, we do have it locally on LC systems (use impi-5.1.3 or peruse /usr/local/tools/impi-5.1.3).

@dongahn
Copy link
Collaborator Author

dongahn commented May 12, 2016

Cool!

@dongahn
Copy link
Collaborator Author

dongahn commented May 12, 2016

@jsthrn: OK. I pushed the changes to the intel_hydra_prelim branch of my fork. Please fetch and rebase. Let me know if this helps you pass the current failure.

@dongahn
Copy link
Collaborator Author

dongahn commented May 12, 2016

Drat... somehow Travis doesn't like my changes. Let me look.

@dongahn
Copy link
Collaborator Author

dongahn commented May 12, 2016

I need to rebase the intel_hydra_prelim to the current upstream master to pickup .travis.yml.

@dongahn
Copy link
Collaborator Author

dongahn commented May 12, 2016

OK. Travis is happy now.

@jsthrn
Copy link
Contributor

jsthrn commented May 13, 2016

@dongahn, we have set up an account for you on one of our development machines. I will send the details by email (don't want the password to be visible on the web!).

@jsthrn
Copy link
Contributor

jsthrn commented May 13, 2016

@dongahn, I just tested your latest version of the code on the test system. Looks like things have moved forward. On a single-node job, STAT daemons attached to the application, obtained its samples and detached successfully.

The stdout file (from --enable-verbose) is here (the stderr file is empty).

For a multi-node job, however, there still seem to be issues. For this, STAT seems to hang just after reporting a completed server handshake (although I don't know whether that is on both nodes or just the local one). The stdout file for that run is here (stderr was empty again).

@dongahn
Copy link
Collaborator Author

dongahn commented May 13, 2016

@dongahn, we have set up an account for you on one of our development machines. I will send the details by email (don't want the password to be visible on the web!).

Great! Thanks.

@dongahn
Copy link
Collaborator Author

dongahn commented May 13, 2016

@dongahn, I just tested your latest version of the code on the test system. Looks like things have moved forward. On a single-node job, STAT daemons attached to the application, obtained its samples and detached successfully.

More progress!

For a multi-node job, however, there still seem to be issues. For this, STAT seems to hang just after reporting a completed server handshake (although I don't know whether that is on both nodes or just the local one).

If the remote one also launched, there should be two stdout files. Do you see both?

@jsthrn
Copy link
Contributor

jsthrn commented May 13, 2016

In that case, the other one was empty. I thought that I'd run it twice by mistake and that was why there was two files.

@dongahn
Copy link
Collaborator Author

dongahn commented May 13, 2016

BTW, I see lots of

couldn't find an entry with an alias r01n01... trying the next alias

I see these error messages on a system where the launcher (mpiexec.hydra in this case) -filled MPIR_Proctable hostname isn't matched w/ what comes out of gethostname() from a back end node.

I will have to check, but I think I have a logic that parses /etc/hosts to test the match with all of the aliases, but in the end we probably need to see the message

found an entry with an alias

if MPIR_Proctable's hostname matches w/ at least one of the alias, which is a requirement for BE to be successful.

We are probably not out of woods yet.

@dongahn
Copy link
Collaborator Author

dongahn commented May 14, 2016

@jsthrn:

So, I poked around your system a bit, and I now believe that you can produce a reasonable port for your environment. However, I discovered that there is a system issue you will have to address and that you will need to add some new code to complete an Intel hydra port.

As I suspected above, this system has hostname consistency issues. As you can see from here, the launchmon backend API runtime tries hard to collect as many hostname aliases as possible for the host where it is running.

Despite this, it turned out, mpiexec.hydra generates unmatchable backend hostnames for MPIR_Proctable -- they don't match w/ any of these aliases. For example, on the first node, hydra generates r01n01.ib0.smc-default.sgi.com as the hostname. But the back-end-collected hostname aliases don't have this. The aliases that the backend tried to match are captured in a log file:

couldn't find an entry with an alias r01n01... trying the next alias
couldn't find an entry with an alias 10.148.0.2... trying the next alias
couldn't find an entry with an alias r01n01.smc-default.sgi.com... trying the next alias
couldn't find an entry with an alias service1... trying the next alias

It has r01n01.smc-default.sgi.com but not r01n01.ib0.smc-default.sgi.com.

I have to think this is fixable... I am not sure if you can fix this issue by adding this ib0 alias to /etc/hosts to each remote node. But it seems worth trying. Nevertheless, this is a system issue as opposed to a LaunchMON issue.

In addition, it appear that you will also need to augment the bulk launching string within LaunchMON to adapt it to hydra's launching options.

As is, the daemon launch string is expanded into something like:

mpiexec.hydra -v -f \
/nas/store/dahn/workspace/launchmon-72933d7/build/test/src/hostnamefn.8839 \
-n 2 /store/dahn/workspace/launchmon-72933d7/build/test/src/be_kicker 10 \
--lmonsharedsec=705078152 --lmonsecchk=22873882

But because how hydra works, this will launch both of the tool daemon processes onto the first node specified in hostnamefn.8839. I believe you can overcome this by using -machine option instead, which contains an explicit machine to process count mapping. But this format isn't something that LaunchMON already supports.

mpiexec.hydra -v -machine \
/nas/store/dahn/workspace/launchmon-72933d7/build/test/src/hostnamefn.8839 \
-n 2 /store/dahn/workspace/launchmon-72933d7/build/test/src/be_kicker 10 \
--lmonsharedsec=705078152 --lmonsecchk=22873882

cat hostnamefn.8839
r01n01:1
r01n02:1

This will require a new launching string option beyond %l like %m, which will then get expanded into the filename which contains that machine-proc mapping info.

Some of the relevant code can be found at here and here.

If you create a patch and submit a PR, I will review and mege it.

There will also be miscellenous work items like adding intel hydra specific code into test codes to complete the port. An example test/src/test.attach_1 I manually modified:

RM_TYPE=RC_intel_hydra
NUMNODES=1

if test "x$RM_TYPE" = "xRC_bglrm" -o "x$RM_TYPE" = "xRC_bgprm"; then
  rm -f nohup.out
fi

NUMTASKS=`expr $NUMNODES \* 16`

WAITAMOUNT=$NUMNODES 
if test $NUMNODES -lt 20 ; then 
  WAITAMOUNT=20
fi 

SIGNUM=10
MPI_JOB_LAUNCHER_PATH=/sw/sdev/intel/parallel_studio_xe_2016_update2/impi/5.1.3.181/intel64/bin/mpiexec.hydra
export LMON_LAUNCHMON_ENGINE_PATH=/store/dahn/workspace/stage/bin/launchmon
if test "x/store/dahn/workspace/launchmon-1c5c420/build/workspace/stage" != "x0"; then
    export LMON_PREFIX=/store/dahn/workspace/stage
else
    export LMON_RM_CONFIG_DIR=0
    export LMON_COLOC_UTIL_DIR=0
fi


if test "x$RM_TYPE" = "xRC_slurm" ; then
  WAITAMOUNT=`expr $WAITAMOUNT`
  $MPI_JOB_LAUNCHER_PATH -n$NUMTASKS -N$NUMNODES -ppdebug `pwd`/hang_on_SIGUSR1 &
elif test "x$RM_TYPE" = "xRC_bglrm" -o "x$RM_TYPE" = "xRC_bgprm"; then
  WAITAMOUNT=`expr $WAITAMOUNT`
  nohup $MPI_JOB_LAUNCHER_PATH -verbose 1 -np $NUMTASKS -exe `pwd`/hang_on_SIGUSR1 -cwd `pwd` &
elif test "x$RM_TYPE" = "xRC_bgqrm"; then
  WAITAMOUNT=`expr $WAITAMOUNT`
  $MPI_JOB_LAUNCHER_PATH --verbose 4 --np $NUMTASKS --exe `pwd`/hang_on_SIGUSR1 --cwd `pwd` --env-all &
elif test "x$RM_TYPE" = "xRC_bgq_slurm"; then
  WAITAMOUNT=`expr $WAITAMOUNT`
  $MPI_JOB_LAUNCHER_PATH -N$NUMNODES -n $NUMTASKS `pwd`/hang_on_SIGUSR1 &
elif test "x$RM_TYPE" = "xRC_alps" ; then
  WAITAMOUNT=`expr $WAITAMOUNT`
  $MPI_JOB_LAUNCHER_PATH -n $NUMTASKS `pwd`/hang_on_SIGUSR1 &
elif test "x$RM_TYPE" = "xRC_orte" ; then
  WAITAMOUNT=`expr $WAITAMOUNT`
  $MPI_JOB_LAUNCHER_PATH -mca debugger mpirx -np $NUMTASKS `pwd`/hang_on_SIGUSR1 &
elif test "x$RM_TYPE" = "xRC_intel_hydra" ; then
  WAITAMOUNT=`expr $WAITAMOUNT`
  $MPI_JOB_LAUNCHER_PATH -np $NUMTASKS `pwd`/hang_on_SIGUSR1 &
else
  echo "This RM is not supported yet" 
fi

PID=`echo $!` 

sleep $WAITAMOUNT #wait until the job gets stalled 

./fe_attach_smoketest $PID `pwd`/be_kicker $SIGNUM 

sleep $WAITAMOUNT

Finally, you will also need to add some config m4 scripts to be able to configure and build the test codes for Intel hydra. Please look at m4 files like here and here.

Hope this helps!

@jsthrn
Copy link
Contributor

jsthrn commented May 16, 2016

Thanks for the very comprehensive instructions! I will try to give this a go, but it might take me a while to get something working.

On the hostname issue, at least for SGI systems the "correct" name (or at least one that will be valid) is always the bit before the first dot (e.g. r1i0n0). Would it be possible to trim the hostname returned by hydra and then use that? I just worry that it will be difficult for individual users to change files like /etc/hostname to include the full alias. Alternatively, since the .ib0. part of the hostname comes from the PBS nodefile, maybe we can parse that before running the application?

@dongahn
Copy link
Collaborator Author

dongahn commented May 16, 2016

There is always a danger if you do the match test only based on the first name. Two different names can be matched as identical.

It feels to me that we probaby don't to introduce that as a default match test. But it seems ok if you add this as an additional partital test and only do thisif the fully qualified tests all fail?

It would also be nice if we can make such test as a config time option through platform_compat

@jsthrn
Copy link
Contributor

jsthrn commented May 26, 2016

So, it looks like the hostname issue can be "fixed" by modifying the nodefile created by PBS. While not ideal, this can be done by the user, whereas /etc/hosts is auto-generated for each node on the system by SGI Management Center. This also would not require changes to use (potentially dangerous) partial matches for hostnames in Launchmon.

If I run like this:

cat $PBS_NODEFILE | sed 's/.ib0//g' > nodefile
export PBS_NODEFILE=${PWD}/nodefile
mpirun -n 4 ./simple

Then the log file entries change to:

couldn't find an entry with an alias r01n03... trying the next alias
couldn't find an entry with an alias 10.148.0.4... trying the next alias
found an entry with an alias r01n03.smc-default.sgi.com.

@jsthrn
Copy link
Contributor

jsthrn commented May 26, 2016

To launch the daemons on the correct nodes, I think that -ppn 1 can be used instead of -machine. This specifies that one process should be launched on each host specified in the hostfile - which I think is what is required.

By altering the rm_intel_hydra.conf file to use this option I can see the daemon's launching on the correct nodes. However, the daemon launched on the remote node does not seem to run properly. The output looks like this:

[proxy:0:0@r01n03] Start PMI_proxy 0
[proxy:0:0@r01n03] STDIN will be redirected to 1 fd(s): 25
[handshake.c:186] - Starting handshake from client
[handshake.c:1125] - Looking up server and client addresses for socket 7
[handshake.c:1156] - Sending sig 845d96c1 on network
[handshake.c:1163] - Receiving sig from network
[handshake.c:308] - Creating outgoing packet for handshake
[handshake.c:319] - Encoded packet: server_port = 34126, client_port = 64470, uid = 48837, gid = 100, session_id = 10, signature = 9b1cc028
[handshake.c:324] - Encrypting outgoing packet
[handshake.c:461] - Server encrypting packet with munge
[handshake.c:548] - Munge encoded packet successfully
[handshake.c:331] - Encrypted packet to buffer of size 212
[handshake.c:1182] - Sending packet size on network
[handshake.c:1190] - Sending packet on network
[handshake.c:1205] - Receiving packet size from network
[handshake.c:1211] - Received packet size 212
[handshake.c:1224] - Received packet from network
[handshake.c:358] - Creating an expected packet
[handshake.c:371] - Decrypting and checking packet
[handshake.c:825] - Decrypting and checking packet with munge
[handshake.c:1071] - Packets compared equal.
[handshake.c:379] - Successfully completed initial handshake
[handshake.c:1094] - Sharing handshake result 0 with peer
[handshake.c:1102] - Reading peer result
[handshake.c:1108] - Peer reported result of 0
[handshake.c:277] - Completed server handshake.  Result = 0
[proxy:0:1@r01n04] Start PMI_proxy 1

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 25233 RUNNING AT r01n04.smc-default.sgi.com
=   EXIT CODE: 134
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
<May 26 08:26:02> <LMON FE API> (ERROR): Received an invalid LMONP msg: Front-end back-end protocol mismatch? or back-end disconnected?
<May 26 08:26:02> <LMON FE API> (ERROR):   A proper msg of {Class(lmonp_febe_security_chk),Type(32767),LMON_payload_size()} is expected.lmonp_fetobe
<May 26 08:26:02> <LMON FE API> (ERROR):   A msg of {Class((null)),Type((null)),LMON_payload_size(6361488)} has been received.
<May 26 08:26:02> <STAT_FrontEnd.C: 586> STAT returned error type STAT_LMON_ERROR: Failed to attach to job launcher and spawn daemons
<May 26 08:26:02> <STAT_FrontEnd.C: 442> STAT returned error type STAT_LMON_ERROR: Failed to attach and spawn daemons
<May 26 08:26:02> <STAT.C: 152> STAT returned error type STAT_LMON_ERROR: Failed to launch MRNet tree()
<May 26 08:26:02> <STAT_FrontEnd.C: 3294> STAT returned error type STAT_FILE_ERROR: Output directory not created.  Performance results not written.
<May 26 08:26:02> <STAT_FrontEnd.C: 3417> STAT returned error type STAT_FILE_ERROR: Failed to dump performance results

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 25233 RUNNING AT r01n04.smc-default.sgi.com
=   EXIT CODE: 6
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

Looking at the output file from the remote node, it looks like the problems is with munge:

[handshake.c:180] - Starting handshake from server
[handshake.c:1125] - Looking up server and client addresses for socket 6
[handshake.c:1156] - Sending sig 845d96c1 on network
[handshake.c:1163] - Receiving sig from network
[handshake.c:308] - Creating outgoing packet for handshake
[handshake.c:319] - Encoded packet: server_port = 34126, client_port = 48772, uid = 48837, gid = 100, session_id = 10, signature = 67ad047e
[handshake.c:324] - Encrypting outgoing packet
[handshake.c:461] - Server encrypting packet with munge
ERROR: [handshake.c:541] - Munge failed to encrypt packet with error: Failed to connect to "/store/jsouthern/packages/munge/0.5.12/var/run/munge/munge.socket.2": Connection refused
[handshake.c:327] - Error in server encrypting outgoing packet[handshake.c:1094] - Sharing handshake result -2 with peer
[handshake.c:1102] - Reading peer result
[handshake.c:1108] - Peer reported result of 212
[handshake.c:277] - Completed server handshake.  Result = -1

It seems that I can't start munge on more than one node as I get errors like:

jsouthern@r01n04:~/STAT $ /store/jsouthern/packages/munge/0.5.12/etc/init.d/munge start
redirecting to systemctl start .service
Starting MUNGE: munged                                                           failed
munged: Error: Found inconsistent state for lock "/store/jsouthern/packages/munge/0.5.12/var/run/munge/munge.socket.2.lock"

Is this something that you have seen before @dongahn? Is there a way to start munged across all nodes at the same time?

@dongahn
Copy link
Collaborator Author

dongahn commented May 26, 2016

So, it looks like the hostname issue can be "fixed" by modifying the nodefile created by PBS. While not ideal, this can be done by the user, whereas /etc/hosts is auto-generated for each node on the system by SGI Management Center. This also would not require changes to use (potentially dangerous) partial matches for hostnames in Launchmon.

Is there anyway to make this transparent for the users? Users having to remember this seems like an usability problem.

@dongahn
Copy link
Collaborator Author

dongahn commented May 26, 2016

Is this something that you have seen before @dongahn? Is there a way to start munged across all nodes at the same time?

I actually removed the secure handshake from tools/handshake for my quick validation on your system. So, I haven't seen this. You can see #if 0 macros from the source file under that directory if you have access to my local copy on your system.

Actually --enable-sec-none config option should disable secure handshake for quick testing. But somehow I wasn't able to get this option to work on your system. But I tried this only once and didn't spend time to look at what was wrong. This was implemented by @mplegendre, if you see issues with that option, please send that along the way.

For quick testing/progress, though, I recommend you to manual disable the secure handshake like I did in my local copy.

@jsthrn
Copy link
Contributor

jsthrn commented May 26, 2016

Yeah, having users make manual alterations to PBS_NODEFILE does seem to be a bit fragile. Long term I think that the solution will be to get the hostname including ib0 included in /etc/hosts. But I can see that being a slow process in terms of rolling out the software to do that - especially for existing customers who probably don't update very often. So, maybe I do need to go back and look at falling back to a partial match.

I will have a look at --enable-sec-none to disable the secure handshake and get back to you with any progress.

@dongahn
Copy link
Collaborator Author

dongahn commented May 26, 2016

@jsthrn: Thanks James!

@jsthrn
Copy link
Contributor

jsthrn commented May 26, 2016

It looks like the modified code runs to completion when configured with --enable-sec-none. And I get plots that look like this:

00_simple.pdf

So, I think that is successful... :-)

@lee218llnl
Copy link
Collaborator

Very nice, that STAT output looks correct. Good job!

@dongahn
Copy link
Collaborator Author

dongahn commented May 26, 2016

Ditto!

@dongahn
Copy link
Collaborator Author

dongahn commented May 26, 2016

BTW when you say the modified code, did you mean my local copy with some section in the handshake src commented out? In theory --enable-sec-none should not require code mods. Did you try this w/o the mods?

@jsthrn
Copy link
Contributor

jsthrn commented May 27, 2016

The modified code is my local copy. So no sections in the handshake commented out. The only code modification I have made is to add the -ppn option to etc/rm_intel_hydra.conf (I also run with the modified nodefile as discussed above).

jsouthern@cy013:~/launchmon $ git status
On branch intel_hydra_prelim
Your branch is up-to-date with 'origin/intel_hydra_prelim'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

        modified:   etc/rm_intel_hydra.conf

no changes added to commit (use "git add" and/or "git commit -a")
jsouthern@cy013:~/launchmon $ git --no-pager diff
diff --git a/etc/rm_intel_hydra.conf b/etc/rm_intel_hydra.conf
index 8fe5248..653f509 100644
--- a/etc/rm_intel_hydra.conf
+++ b/etc/rm_intel_hydra.conf
@@ -51,4 +51,4 @@ RM_launcher_id=RM_launcher|sym|i_mpi_hyd_cr_init
 RM_launch_helper=mpiexec.hydra
 RM_signal_for_kill=SIGINT|SIGINT
 RM_fail_detection=true
-RM_launch_str=-v -f %l -n %n %d %o --lmonsharedsec=%s --lmonsecchk=%c
+RM_launch_str=-v -f %l -n %n -ppn 1 %d %o --lmonsharedsec=%s --lmonsecchk=%c

@jsthrn
Copy link
Contributor

jsthrn commented May 27, 2016

@dongahn, I have some commits on the intel_hydra_prelim branch that implement adding SGI hostnames and enabling the use of these via a configure flag.

This completes the port (I think), although not the miscellaneous tests. I'm not sure how to go about submitting a pull request? I'd like to be able to do it by pushing my commits on the branch and then selecting the "Pull Request" option above with the relevant branches. However, I don't seem to have permissions to push to the repository. Is it possible to enable that for me please?

@jsthrn
Copy link
Contributor

jsthrn commented May 31, 2016

@dongahn, I have been looking at modifying the tests for use with Intel MPI today. It seems like the tests of attaching to a running process work - although I am not 100% sure what the expected output is in some cases - but there is still an error when launching an application via Launchmon (so, e.g. test.launch_1 fails).

The launch tests fail with errors like:

[mpiexec@r01n01] HYDU_parse_hostfile (../../utils/args/args.c:535): unable to open host file: nodelist

So, it looks like mpiexec.hydra is looking for a nodelist (command line argument -f) which is not present.

All my previous work has been looking at attaching to a running process. Is there something obvious in etc/rm_intel_hydra.conf that I can change in order to cause a launch via LaunchMON to not use a nodelist, while still using one when attaching.

jsouthern@cy013:~/launchmon $ cat etc/rm_intel_hydra.conf
## $Header: $
##
## rm_intel_hydra.conf
##
##--------------------------------------------------------------------------------
## Copyright (c) 2008, Lawrence Livermore National Security, LLC. Produced at
## the Lawrence Livermore National Laboratory. Written by Dong H. Ahn <[email protected]>.
## LLNL-CODE-409469. All rights reserved.
##
## This file is part of LaunchMON. For details, see
## https://computing.llnl.gov/?set=resources&page=os_projects
##
## Please also read LICENSE -- Our Notice and GNU Lesser General Public License.
##
##
## This program is free software; you can redistribute it and/or modify it under the
## terms of the GNU General Public License (as published by the Free Software
## Foundation) version 2.1 dated February 1999.
##
## This program is distributed in the hope that it will be useful, but WITHOUT ANY
## WARRANTY; without even the IMPLIED WARRANTY OF MERCHANTABILITY or
## FITNESS FOR A PARTICULAR PURPOSE. See the terms and conditions of the GNU
## General Public License for more details.
##
## You should have received a copy of the GNU Lesser General Public License along
## with this program; if not, write to the Free Software Foundation, Inc., 59 Temple
## Place, Suite 330, Boston, MA 02111-1307 USA
##--------------------------------------------------------------------------------
##
##  Update Log:
##        May 05 2016 DHA: Created file.
##
##
## RM: the name of Resource Manager
## RM_launcher: the name of the launcher command
## RM_launcher_id: the rule to get the launcher id
## (e.g., RM_launcher|sym|srun says the launcher is identify by testing
##        RM_launcher's symbol by the name of srun)
## RM_jobid: the rule to get the target jobid
## (e.g., RM_jobid=RM_launcher|sym|totalview_jobid|string says
##        jobid can be obtained from the launcher's symbol, totalview_jobid,
##        interpreting that as the string type.
## RM_launcher_helper= method or command to launch daemons
## RM_launch_str= options and arguements used for RM_launch_mth.
##

RM=intel_hydra
RM_MPIR=STD
RM_launcher=mpiexec.hydra
RM_launcher_id=RM_launcher|sym|i_mpi_hyd_cr_init
RM_launch_helper=mpiexec.hydra
RM_signal_for_kill=SIGINT|SIGINT
RM_fail_detection=true
RM_launch_str=-f %l -n %n -ppn 1 %d %o --lmonsharedsec=%s --lmonsecchk=%c

@dongahn
Copy link
Collaborator Author

dongahn commented May 31, 2016

@dongahn, I have some commits on the intel_hydra_prelim branch that implement adding SGI hostnames and enabling the use of these via a configure flag.

This completes the port (I think), although not the miscellaneous tests. I'm not sure how to go about submitting a pull request? I'd like to be able to do it by pushing my commits on the branch and then selecting the "Pull Request" option above with the relevant branches. However, I don't seem to have permissions to push to the repository. Is it possible to enable that for me please?

@jsthrn: Sorry for the late response. So I sent you a collaborator request. Up on accepting it, you should have a push privilege, I think.

@dongahn
Copy link
Collaborator Author

dongahn commented May 31, 2016

@dongahn, I have been looking at modifying the tests for use with Intel MPI today. It seems like the tests of attaching to a running process work - although I am not 100% sure what the expected output is in some cases - but there is still an error when launching an application via Launchmon (so, e.g. test.launch_1 fails).

The launch tests fail with errors like:

So, it looks like mpiexec.hydra is looking for a nodelist (command line argument -f) which is not present.

All my previous work has been looking at attaching to a running process. Is there something obvious in etc/rm_intel_hydra.conf that I can change in order to cause a launch via LaunchMON to not use a nodelist, while still using one when attaching.

The rm configuration file looks reasonable to me although you will probably test whether sending two consecutive SIGINTs is the right sequence to kill the target job cleanly in Hydra. Different RMs can have different ways to "cleanly" kill the job and you have to adjust your configuration for Hydra.

In addition, test.launch_6_engine_failure should allow you to manually test the various failure semantics. The semantics is documented in here.

Now, when I tested for feasibility on your system for launch mode, I was able to get test.launch_1 to work. So I don't think there is anything fundamentally wrong. At the point where this test is ready to launch the tool daemons, the hostname file should have been generated and -f %l should be expanded into a valid string.

If the complain about -f comes from the launching string of the target application itself, IOW, the MPI application, that's a different story.

The front-end test code (test/src/fe_launch_smoketest.cxx) I used for testing actually used -f nodelist to test whether mpiexec.hydra knows how to launch a job using the manually written nodelist.

Your port shouldn't use that flag. Instead whatever the set of flags you will use to launch an MPI application under an interactive batch allocation should the ones you should type into the front-end test code. Hope this helps...

@jsthrn
Copy link
Contributor

jsthrn commented Jun 1, 2016

Thanks. I changed test/src/fe_launch_smoketest.cxx to launch with mpiexec.hydra -n <numprocs>. I think that this is the correct set of flags under an interactive batch allocation (it works for me). The only slight issue might be with cases like test.launch_2_half, where all the MPI processes run on the first (of two) nodes. I am not sure if they are supposed to be split equally between the two.

Pressing <Ctrl-C> twice does seem to be the correct sequence to kill the target cleanly.

I have submitted a pull request containing my changes. I am not sure exactly what the correct behaviour for all of the tests is, but I think that most pass. Issues that I am aware of include:

  • test.attach_1_pdebugmax: Runs (and passes the test), but does not terminate (basically keeps printing APP (INFO): stall for 3 secs indefinitely.
  • test.launch_mw_1_hostlist and test.launch_mw_5_mixall: Complete initial handshake, but then respond with cy013.ib0.smc-default.sgi.com: Connection refused and the tests do not appear to continue (although the application does resume). cy013 is the cluster head node (where compilation occurs, but no MPI processes run). This behaviour is not seen for test.launch_mw_2_coloc, which does pass.
  • test.attach_3_*: All fail with output like including <LMON FE API> (ERROR): the launchmon engine encountered an error while parsing its command line. and <LMON FE API> (ERROR): LMON_fe_acceptEngine failed. However, these look like they may be expected fails.
  • test.launch_3_invalid_dmonpath: Also may be an expected fail. Test outputs <OptionParser> (ERROR): the path[/invalid/be_kicker] does not exit. and then fails.

@dongahn
Copy link
Collaborator Author

dongahn commented Jun 1, 2016

Thanks. I changed test/src/fe_launch_smoketest.cxx to launch with mpiexec.hydra -n . I think that this is the correct set of flags under an interactive batch allocation (it works for me). The only slight issue might be with cases like test.launch_2_half, where all the MPI processes run on the first (of two) nodes. I am not sure if they are supposed to be split equally between the two.

I think however the half of the processes are split should be ok as far as the number of processes are half loaded and the launching work under the RM.

Pressing twice does seem to be the correct sequence to kill the target cleanly.

I have submitted a pull request containing my changes. I am not sure exactly what the correct behaviour for all of the tests is

Automatic testing is one of the areas for improvements. Hopefully, #25 can help LaunchMON into a good direction for this.

test.attach_1_pdebugmax: Runs (and passes the test), but does not terminate (basically keeps printing APP (INFO): stall for 3 secs indefinitely.

This test should be skipped for Hydra.

test.launch_mw_1_hostlist and test.launch_mw_5_mixall: Complete initial handshake, but then respond with cy013.ib0.smc-default.sgi.com: Connection refused and the tests do not appear to continue (although the application does resume). cy013 is the cluster head node (where compilation occurs, but no MPI processes run). This behaviour is not seen for test.launch_mw_2_coloc, which does pass.

This can be adjusted by explicitly specifying the names of the node where the middleware processes should be launched. I think the config option for this is --with-test-mw-hostlist Please do ./configure --help. Essentially, the test codes use rsh or ssh to launch the middleware daemons and connect them to the rest of the tool daemons.

test.launch_mw_2_coloc won't have this issue because it uses "COLOC" mode by which to use the already running back-end daemons to spawn middleware daemons.

test.attach_3_*: All fail with output like including (ERROR): the launchmon engine encountered an error while parsing its command line. and (ERROR): LMON_fe_acceptEngine failed. However, these look like they may be expected fails.
test.launch_3_invalid_dmonpath: Also may be an expected fail. Test outputs (ERROR): the path[/invalid/be_kicker] does not exit. and then fails.

I think these are expected failures. Again the testing results should be improved as part of our future efforts.

I believe you have come a long way. Thanks and moving onto your PR.

@dongahn
Copy link
Collaborator Author

dongahn commented Jun 1, 2016

@jsthrn: FYI -- my review comments for your PR is in my LaunchMON fork. Thanks.

@jsthrn
Copy link
Contributor

jsthrn commented Jun 3, 2016

Hi @dongahn. I will implement the comments for my PR. However, I guess that this will be next week now.

@dongahn
Copy link
Collaborator Author

dongahn commented Jun 3, 2016

Thanks @jsthrn!

@dongahn
Copy link
Collaborator Author

dongahn commented Jun 6, 2016

@jsthrn: is there any other work you plan to do on LaunchMON? If not, I can close this issue.

@dongahn dongahn closed this as completed Jul 1, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants