-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assist SGI to port to Intel MPI with Hydra launcher #14
Comments
Hostlist file fIx added to PR #18 can help with this environment as well. |
I pushed a commit into my fork just to start to assist James Southern ([email protected]) with porting STAT/LaunchMON on Intel Hydra for AWE: the commit is here |
As you can see from here, the LaunchMON backend API expects its options are found at the end of the command line. So if there are other stuff that mpiexec hydra also append to the backend launch string, launchmon will not proceed. I guess that's sort of the case from your email:
|
@jsthrn: testing for your GitHub id. |
The test of my ID worked. I got the email and the link points to my profile. James |
I checked out the intel_hydra_prelim branch. Unfortunately I can't get it to build. After updating autotools, I now see the following output:
Is this something that you have seen before? I can see that there was a version of libgcrypt in the tools/ directory previously, but now that is missing. Do I need to install a version elsewhere (and then provide a way for automake to see it)? |
Regarding mpiexec.hydra appending its own flags to the backend, I can certainly see that could be possible (the "--exec-<>" ones). However, there's also the two copies of "/store/jsouthern/tmp/install/bin/STATD --lmonsharedsec=2082992184 --lmonsecchk=548371161" in the command line. One of these is the very last thing, so that would suggest that things should actually be ok. The full daemon command line (copied from above, but a bit more readable here!) is:
So, this does have the Launchmon options right at the end as required. Note that for another application I get the following (which also has two copies of the executable - again with one at the end, so maybe that is correct?):
|
The bundled grcypt has been deprecated, as the bundled version was getting older and has given problems to various packaging systems. As far as you have a decent gcrypt package installed on your system, this should be okay.
--with-myboost has also been deprecated as well, and a version of boost is now a requirement to build launchmon. Can you make sure the following packages are installed on your system? (What Linux distribution are you using?)
What happens if you just run once these requirements are satisfied?
|
Did |
OK. Thanks. Once you get to the point where you can reproduce the original problem using LaunchMON's own simple test of the new version. Let's tease apart this problem as well. |
So, after building various packages and updating the Launchmon build, it looks like I can now reproduce the original problem. Output (with "-V" switched off in mpiexec.hydra) is:
Full output (with "-V" enabled) is shown in this file |
@jsthrn: Progress! It is kind of difficult to see where the backend daemons die or whether they have even been launched. Could you quickly run the configure again with the following config option and rebuild?
If this works (and daemons are indeed launched and failed), running your test should dump some output files into Also kind of curious who's returning 6 as the exit code. |
I ran with It looks to me like there is a problem with my munge install - which presumably isn't what we saw with the release version of Launchmon as that doesn't use munge!). I will have a look at this and see whether I can work out why the munge.socket.2 file is missing on my system. By the way, @dongahn we are making progress with getting you access to a test system with our software stack enabled (it will be very old hardware, but that shouldn't be an issue). |
So, it turned out that I hadn't started the munge daemon, which explains why that didn't work! Once I do that I get more output - and no exit code 6. Here are the updated be.stdout and be.stderr files. These now look more like the errors I was seeing previously, with " |
@dongahn, I am requesting an account for you on a system now. I've already verified that Launchmon (and the rest of the STAT toolchain) builds and runs on the system. Please let me know your preferred shell (bash, csh, tcsh, ksh, zsh) and I will submit the request. |
great. tcsh should work. |
Thanks. I submitted the request. Hopefully they should come back to you direct with the logon details. If not then I will forward them to you when I have them. |
OK. I looked at the trace and you are much farther along with the munge fix. Apparently the error is coming out at here. And this is because of the error percolating from the backend's procctl layer from here. Procctl is the layer responsible for normalizing resource manager (RM)-specific synchronization mechanisms between target MPI job and the tools. RMs implement MPIR debug interface for this purpose but how they implement this is different across different RMs. So LauchMON introduced procctl layer. Two things:
I will take a wild guess and add the case statements to help you address 1 first. Once you get pass that, you may want to get a feasibility that STAT can attach to a hung job. Then, let's discuss what needs to be done for 2. This could be as simple as you educating me about hydra's MPI-tool synchronization mechanisms and me choosing the right procctl primitives to adjust LaunchMON to hydra. |
@jsthrn: By the way, once this port is all done, it will be nice if you can provide us with your environments. As part of #25, @mcfadden8 wants to investigate how much RM-specific stuff we can integrate into Travis CI (as a separate testing instance) and ideally we want to be able to do this for as many RMs as possible, which LaunchMON supports. Does Intel MPI require a license to use? |
@dongahn Intel MPI does not require a license to run, just install. FYI, we do have it locally on LC systems ( |
Cool! |
@jsthrn: OK. I pushed the changes to the |
Drat... somehow Travis doesn't like my changes. Let me look. |
I need to rebase the |
OK. Travis is happy now. |
@dongahn, we have set up an account for you on one of our development machines. I will send the details by email (don't want the password to be visible on the web!). |
@dongahn, I just tested your latest version of the code on the test system. Looks like things have moved forward. On a single-node job, STAT daemons attached to the application, obtained its samples and detached successfully. The stdout file (from For a multi-node job, however, there still seem to be issues. For this, STAT seems to hang just after reporting a completed server handshake (although I don't know whether that is on both nodes or just the local one). The stdout file for that run is here (stderr was empty again). |
Great! Thanks. |
More progress!
If the remote one also launched, there should be two stdout files. Do you see both? |
In that case, the other one was empty. I thought that I'd run it twice by mistake and that was why there was two files. |
BTW, I see lots of
I see these error messages on a system where the launcher ( I will have to check, but I think I have a logic that parses /etc/hosts to test the match with all of the aliases, but in the end we probably need to see the message
if MPIR_Proctable's hostname matches w/ at least one of the alias, which is a requirement for BE to be successful. We are probably not out of woods yet. |
So, I poked around your system a bit, and I now believe that you can produce a reasonable port for your environment. However, I discovered that there is a system issue you will have to address and that you will need to add some new code to complete an Intel hydra port. As I suspected above, this system has hostname consistency issues. As you can see from here, the launchmon backend API runtime tries hard to collect as many hostname aliases as possible for the host where it is running. Despite this, it turned out,
It has I have to think this is fixable... I am not sure if you can fix this issue by adding this In addition, it appear that you will also need to augment the bulk launching string within LaunchMON to adapt it to hydra's launching options. As is, the daemon launch string is expanded into something like:
But because how hydra works, this will launch both of the tool daemon processes onto the first node specified in
This will require a new launching string option beyond Some of the relevant code can be found at here and here. If you create a patch and submit a PR, I will review and mege it. There will also be miscellenous work items like adding intel hydra specific code into test codes to complete the port. An example
Finally, you will also need to add some config m4 scripts to be able to configure and build the test codes for Intel hydra. Please look at m4 files like here and here. Hope this helps! |
Thanks for the very comprehensive instructions! I will try to give this a go, but it might take me a while to get something working. On the hostname issue, at least for SGI systems the "correct" name (or at least one that will be valid) is always the bit before the first dot (e.g. |
There is always a danger if you do the match test only based on the first name. Two different names can be matched as identical. It feels to me that we probaby don't to introduce that as a default match test. But it seems ok if you add this as an additional partital test and only do thisif the fully qualified tests all fail? It would also be nice if we can make such test as a config time option through platform_compat |
So, it looks like the hostname issue can be "fixed" by modifying the nodefile created by PBS. While not ideal, this can be done by the user, whereas If I run like this:
Then the log file entries change to:
|
To launch the daemons on the correct nodes, I think that By altering the
Looking at the output file from the remote node, it looks like the problems is with munge:
It seems that I can't start munge on more than one node as I get errors like:
Is this something that you have seen before @dongahn? Is there a way to start |
Is there anyway to make this transparent for the users? Users having to remember this seems like an usability problem. |
I actually removed the secure handshake from Actually For quick testing/progress, though, I recommend you to manual disable the secure handshake like I did in my local copy. |
Yeah, having users make manual alterations to I will have a look at |
@jsthrn: Thanks James! |
It looks like the modified code runs to completion when configured with So, I think that is successful... :-) |
Very nice, that STAT output looks correct. Good job! |
Ditto! |
BTW when you say the modified code, did you mean my local copy with some section in the handshake src commented out? In theory --enable-sec-none should not require code mods. Did you try this w/o the mods? |
The modified code is my local copy. So no sections in the handshake commented out. The only code modification I have made is to add the
|
@dongahn, I have some commits on the This completes the port (I think), although not the miscellaneous tests. I'm not sure how to go about submitting a pull request? I'd like to be able to do it by pushing my commits on the branch and then selecting the "Pull Request" option above with the relevant branches. However, I don't seem to have permissions to push to the repository. Is it possible to enable that for me please? |
@dongahn, I have been looking at modifying the tests for use with Intel MPI today. It seems like the tests of attaching to a running process work - although I am not 100% sure what the expected output is in some cases - but there is still an error when launching an application via Launchmon (so, e.g. The launch tests fail with errors like:
So, it looks like All my previous work has been looking at attaching to a running process. Is there something obvious in
|
@jsthrn: Sorry for the late response. So I sent you a collaborator request. Up on accepting it, you should have a push privilege, I think. |
The rm configuration file looks reasonable to me although you will probably test whether sending two consecutive In addition, Now, when I tested for feasibility on your system for launch mode, I was able to get If the complain about The front-end test code ( Your port shouldn't use that flag. Instead whatever the set of flags you will use to launch an MPI application under an interactive batch allocation should the ones you should type into the front-end test code. Hope this helps... |
Thanks. I changed Pressing I have submitted a pull request containing my changes. I am not sure exactly what the correct behaviour for all of the tests is, but I think that most pass. Issues that I am aware of include:
|
I think however the half of the processes are split should be ok as far as the number of processes are half loaded and the launching work under the RM.
Automatic testing is one of the areas for improvements. Hopefully, #25 can help LaunchMON into a good direction for this.
This test should be skipped for Hydra.
This can be adjusted by explicitly specifying the names of the node where the middleware processes should be launched. I think the config option for this is
I think these are expected failures. Again the testing results should be improved as part of our future efforts. I believe you have come a long way. Thanks and moving onto your PR. |
@jsthrn: FYI -- my review comments for your PR is in my LaunchMON fork. Thanks. |
Hi @dongahn. I will implement the comments for my PR. However, I guess that this will be next week now. |
Thanks @jsthrn! |
@jsthrn: is there any other work you plan to do on LaunchMON? If not, I can close this issue. |
There is a out of band communication to port LaunchMON to Intel MPI with Hydra environment. Created this ticket to capture any significant issues that may arise for that effort.
The text was updated successfully, but these errors were encountered: