Skip to content
This repository has been archived by the owner on Sep 30, 2022. It is now read-only.

Level Setting: What are we running now? #1

Open
jjhursey opened this issue Feb 24, 2020 · 6 comments
Open

Level Setting: What are we running now? #1

jjhursey opened this issue Feb 24, 2020 · 6 comments
Labels
question Further information is requested

Comments

@jjhursey
Copy link
Member

During the Open MPI face-to-face we discussed moving some of the CI checks to AWS which can harness some parallel instances to speed things up. Then each organization can focus on testing special configurations in their environments.

To start this discussion I'd like to see what tests various organizations are running in their CI. Once we have the list then we can work on removing duplicate efforts. We can use this repo as we need to help facilitate this coordination.

Please reply with a comment listing what you are testing now.

@jjhursey jjhursey added the question Further information is requested label Feb 24, 2020
@jjhursey
Copy link
Member Author

jjhursey commented Feb 24, 2020

  • @artemry-mlnx Can you list what Mellanox is testing in it's CI?
  • @bwbarrett or @wckzhang Can you list what AWS is testing in it's CI?
  • @hppritcha Can you list what the Cray builder is testing in it's CI?
  • I'll add what IBM is testing.
  • Is there anyone else that should be in the conversation?

What I'm looking for is:

  • Platform (including architecture, networking, and specialized hardware)
  • Configure options (including any components that you want to make sure are built)
  • Build options
  • Other build types (e.g., make distcheck)
  • Test runs (what are you testing for?)

@jjhursey
Copy link
Member Author

jjhursey commented Feb 24, 2020

IBM CI

IBM CI machines are located behind a firewall. As a result, our backend Jenkins service polls GitHub about every 5 minutes for changes to PRs. After a test finishes running we have custom scripts that push the results back as a Gist for the community to see.

Our Open MPI testing is not too specialized at this point. We do have a "virtual cluster" capability available to the community that can scale to 254 nodes on demand. We currently limit the community to 160 nodes, but that can be adjusted.

Platform

  • ppc64le either Power8 or Power9.
  • Infiniband and TCP networking. Currently, our CI only tests TCP.
  • NVIDIA GPUs
  • Compilers:
    • GNU 4.8.5
    • IBM XL V16.1.1
    • PGI 19.10-0

Configure options

We run three concurrent builds. Currently, PGI is disabled but will be re-enabled soon.

  • GNU Build: ./configure --prefix=/workspace/exports/ompi
  • XL Build: ./configure --prefix=/workspace/exports/ompi --disable-dlopen CC=xlc_r CXX=xlC_r FC=xlf_r
  • PGI Build: ./configure --prefix=/workspace/exports/ompi --without-x CC=pgcc18 CXX=pgc++ FC=pgfortran

Build options

  • make -j 20

Other build types

  • None

Tests run

We run across 10 machines with the GNU build, and 2 machines with the other builds. The goal of this testing is to:

  1. Verify that the build is functional and can pass messages. So we test C and Fortran in a non-communicating and communicating program.
  2. The multi-host launch works correctly.

We run the following tests:

  • Open MPI examples
    • hello_c
    • hello_mpifh
    • hello_usempi
    • ring_c
    • ring_mpifh
    • ring_usempi

All run like:

shell$ /workspace/exports/ompi/bin/mpirun --hostfile /workspace/hostfile.txt --map-by ppr:4:node  --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 --mca pml ob1 --mca osc ^ucx --mca btl tcp,vader,self ring_c

Timing

A successful run through CI will take about 12-15 minutes. Most of that is building OMPI.

GNU:
autogen      : (0:03:54)
configure    : (0:02:22)
make         : (0:03:22)
make install : (0:00:58)

XL:
autogen      : (0:03:52)
configure    : (0:03:41)
make         : (0:05:29)
make install : (0:00:35)

PGI:
autogen      : (0:03:33)
configure    : (0:06:05)
make         : (0:10:01)
make install : (0:01:08)

@artemry-nv
Copy link

artemry-nv commented Feb 28, 2020

Mellanox Open MPI CI

Scope

Mellanox Open MPI CI is intended to verify Open MPI with recent Mellanox SW components (Mellanox OFED, UCX and other HPC-X components) in the Mellanox lab environment.

CI is managed by Azure Pipelines service.

Mellanox Open MPI CI includes:

  • Open MPI building with internal stable engineering versions of UCX and HCOLL. The building is run in Docker-based environment.
  • Sanity functional testing.

Related materials:

Platform

  • CI is run on a virtual machine in a docker environment (everything is run within one docker container)
  • x86_64 (Intel Xeon E312xx (Sandy Bridge, IBRS update), 15 cores)
  • Mellanox Infiniband MT27800 Family [ConnectX-5]
  • OS: RHEL 7.6 (CI is run under docker in CentOS 7.6.1810)
  • Compiler: gcc 4.8.5
  • Using UCX, HCOLL from daily HPC-X builds

CI Scenarios

Configure options

Specific configure options (combinations may be used):

--with-platform=contrib/platform/mellanox/optimized
--with-ompi-param-check
--enable-picky
--enable-mpi-thread-multiple
--enable-opal-multi-threads

Build options

make -j$(nproc)

Build scenario:

./autogen.sh
./configure ...
make -j$(nproc) install
make -j$(nproc) check

Tests run

Sanity tests (over UCX/HCOLL):

  • hello_c, ring_c
  • hello_oshmem, oshmem_circular_shift, oshmem_shmalloc, oshmem_strided_puts, oshmem_symmetric_data
  • tune test (--mca mca_base_env_list, --tune, --am)

Timing

CI takes ~18-20 min. (mostly Open MPI building).

@jjhursey
Copy link
Member Author

jjhursey commented Mar 3, 2020

Thanks @artemry-mlnx for that information. Do you test with oshmem as well?

Should we be running make distcheck in CI? Are there other OMPI integrity checks that we should be running on a routine basis?

I'm going to be out for a week, but don't let that stop progress.

@jsquyres
Copy link
Member

jsquyres commented Mar 3, 2020

Should we be running make distcheck in CI?

Yes! But do it in parallel to other CI jobs, because distcheck takes a while. Make sure to use whatever the appropriate AM flag is to pass down a good -j value for make into the process so that the multiple builds that distcheck does aren't performed serially. This can significantly speed up overall distcheck time.

Are there other OMPI integrity checks that we should be running on a routine basis?

make check?

@wckzhang
Copy link

wckzhang commented Mar 4, 2020

Currently, our testing includes mtt with EFA and TCP. This tests v2.x, v3.0.x, v3.1.x, v4.0.x, and master. These are the configure options:

--oversubscribe --enable-picky --enable-debug --enable-mpirun-prefix-by-default --disable-dlopen --enable-io-romio --enable-mca-no-build=io-ompio,common-ompio,sharedfp,fbtl,fcoll CC=xlc_r CXX=xlC_r FC=xlf_r --with-ofi=/opt/amazon/efa/

CFLAGS=-pipe --enable-picky --enable-debug --with-ofi=/opt/amazon/efa/

--enable-static --disable-shared

In our nightly, canary, and CI tests for libfabric, we always only use Open MPI 4.0.2 (Soon to be switched to 4.0.3). We use the release versions rather than pulling from the GitHub branch directly. These tests mainly run on our network optimized instance types, such as the c5n model types - https://aws.amazon.com/ec2/instance-types/

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants