Skip to content

Latest commit

 

History

History
4613 lines (3878 loc) · 137 KB

atrip.org

File metadata and controls

4613 lines (3878 loc) · 137 KB

ATRIP: An MPI-asynchronous implementation of CCSD(T)

Introduction

The algorithm uses two main data types, the Slice and the SliceUnion as a container and resource manager of the Slice.

Tests

We will try to test some programatic functionality in atrip too, and mostly we will append all tests in a single file, however this might change for special purpose test functionality.

#include <atrip.hpp>
#include <cassert>
<<testcase-headers>>

<<testcase-prologs>>

#define TESTCASE(_name, ...) {                  \
    std::cout << "\x1b[35m-> \x1b[0m"           \
              <<  _name                         \
              << std::endl;                     \
    __VA_ARGS__                                 \
    }

int main() {

  <<testcase>>

  return 0;
}

Therefore, small tests can be written like

#+begin_src c++ :noweb-ref testcase
// your c++ snippet of code
#+end_src

The slice

The following section introduces the idea of a slice.

Prolog

#pragma once
#include <iostream>
#include <algorithm>
#include <vector>
#include <mpi.h>

#include <atrip/Tuples.hpp>
#include <atrip/Utils.hpp>

namespace atrip {



template <typename F=double>
struct Slice {

Introduction

A slice is the concept of a subset of values of a given tensor. As an example, for the doubles amplitudes \( Tabij \), one need two kinds of objects:

  • the object \( \mathsf{T}(a)^bij \) which for every \( a \) gets assigned the tensor \( Tabij \) of size \( N_\mathrm{o}^2 × N_\mathrm{v} \)
  • the object \( \mathsf{T}(a,b)ij \) which for every pair of \( a, b \) corresponds the \( N_\mathrm{o}^2 \)-sized tensor \( Tabij \).

Location

Every slice set, for instance, $$ S_k = \left\{ a \mapsto \mathsf{T}(a)bij \mid a ∈ A_k \right\} $$ where \( A_k \) is some subset of \( \mathsf{N}_\mathrm{v} \), gets stored in some rank \( k \). In general however, the number of elements in \( A_k \) can be bigger than the number of processes \( n_p \). Therefore in order to uniquely indentify a given slice in \( S_k \) we need two identifiers, the rank \( k \), which tells us in which core’s memory the slice is allocated, and an additional tag which we will call source.

The datatype that simply models this state of affairs is therefore a simple structure:

struct Location { size_t rank; size_t source; };

Type

Due to the permutation operators in the equations it is noticeable that for every one dimensional slice and triple \( (a,b,c) \) $$ a \mapsto \mathsf{t}(a) $$ one needs at the same time \( \mathsf{t}(a) \), \( \mathsf{t}(b) \) and \( \mathsf{t}(c) \). For two dimensional slices, i.e., slices of the form $$ (a,b) \mapsto \mathsf{t}(a,b) $$ one needs in the equations the slices \( \mathsf{t}(a,b) \), \( \mathsf{t}(b,c) \) and \( \mathsf{t}(a,c) \). In addition, in the case of diagrams where the integral \( Vabci \) appears, we additionaly need the permuted slices from before, i.e. \( \mathsf{t}(b,a) \), \( \mathsf{t}(c,b) \) and \( \mathsf{t}(c,a) \).

This means, every slice has associated with it a type which denotes which permutation it is.

enum Type
  { A = 10
  , B
  , C
  // Two-parameter slices
  , AB = 20
  , BC
  , AC
  // for abci and the doubles
  , CB
  , BA
  , CA
  // The non-typed slice
  , Blank = 404
  };

State

Every slice can be in different states and every state denotes which function the slice is going to provide and which relations they have between themselves.

Fetch
A slice is in state Fetch when it has a valid data pointer that **must** be written to. A Fetch slice should not live very long, this means that after the database send and receive phase, Fetch slices should be changed into Dispatched in order to start the process of writing to the data pointer from some other rank.
Dispatched
A Dispatched slice indicates that at some point send and receive MPI calls have been dispatched in order to get the data. However, the calls have just been dispatched and there is no warranty for the data to be there, for that, the slice must be unwrapped.
Ready
Ready means that the data pointer can be read from directly.
SelfSufficient
A slice is SelfSufficient when its contents are located in the same rank that it lives, so that it does not have to fetch from no other rank. This is important in order to handle the data pointers correctly and in order to save calls to MPI receive and send functions.
Recycled
Recycled means that this slice gets its data pointer from another slice, so it should not be written to
Acceptor
Acceptor means that the slice can accept a new slice, it is the counterpart of the Blank type, but for states

Again the implementation is a simple enum type.

enum State {
  Fetch = 0,
  Dispatched = 2,
  Ready = 1,
  SelfSufficient = 911,
  Recycled = 123,
  Acceptor = 405
};

Data pointer

The data pointer type is an abstraction of where the data of the slice are. In a CPU architecture it will simply be F*, otherwise it will be a pointer to a GPU address in the case of a GPU architecture.

#pragma once
#include <atrip/Complex.hpp>
#include <atrip/Atrip.hpp>

namespace atrip {

template <typename F>
struct DataField;

template <>
struct DataField<double> {
  using type = double;
};

#if defined(HAVE_CUDA)

template <typename F>
using DataPtr = CUdeviceptr;
#define DataNullPtr 0x00

template <>
struct DataField<Complex> {
  using type = cuDoubleComplex;
};


#else

template <typename F>
using DataPtr = F*;
#define DataNullPtr nullptr

template <>
struct DataField<Complex> {
  using type = Complex;
};

#endif


template <typename F>
using DataFieldType = typename DataField<F>::type;

}

The Info structure

Every slice has an information structure associated with it that keeps track of the **variable** type, state and so on.

struct Info {
  // which part of a,b,c the slice holds
  PartialTuple tuple;
  // The type of slice for the user to retrieve the correct one
  Type type;
  // What is the state of the slice
  State state;
  // Where the slice is to be retrieved
  Location from;
  // If the data are actually to be found in this other slice
  Type recycling;

  Info() : tuple{0,0}
          , type{Blank}
          , state{Acceptor}
          , from{0,0}
          , recycling{Blank}
          {}
};

using Ty_x_Tu = std::pair< Type, PartialTuple >;

Name

CCSD(T) needs in this algorithm 5 types of tensor slices, namely \( Vijka \), \( Vabci \), \( Vabij \) and two times \( Tabij \). The reason why we need two times the doubles amplitudes is because in the doubles contribution to the energy, the \( T \) amplidutes will be sliced through one parameter for the particle contribution and through two parameters for the hole contribution.

enum Name
  { TA    = 100
  , VIJKA = 101
  , VABCI = 200
  , TABIJ = 201
  , VABIJ = 202
  };

Database

The database is a simple representation of the slices of a slice union. Every element of the database is given by the name of the tensor it represents and the internal information structure.

struct LocalDatabaseElement {
  Slice<F>::Name name;
  Slice<F>::Info info;
};

A local database (of a given rank) and the global database is thus simply a vector of these elements.

using LocalDatabase = std::vector<LocalDatabaseElement>;
using Database = LocalDatabase;

MPI Types

struct mpi {

  static MPI_Datatype vector(size_t n, MPI_Datatype const& DT) {
    MPI_Datatype dt;
    MPI_Type_vector(n, 1, 1, DT, &dt);
    MPI_Type_commit(&dt);
    return dt;
  }

  static MPI_Datatype slice_location () {
    constexpr int n = 2;
    // create a slice_location to measure in the current architecture
    // the packing of the struct
    Slice<F>::Location measure;
    MPI_Datatype dt;
    const std::vector<int> lengths(n, 1);
    const MPI_Datatype types[n] = {usize_dt(), usize_dt()};

    static_assert(sizeof(Slice<F>::Location) == 2 * sizeof(size_t),
                  "The Location packing is wrong in your compiler");

    // measure the displacements in the struct
    size_t j = 0;
    MPI_Aint base_address, displacements[n];
    MPI_Get_address(&measure,        &base_address);
    MPI_Get_address(&measure.rank,   &displacements[j++]);
    MPI_Get_address(&measure.source, &displacements[j++]);
    for (size_t i = 0; i < n; i++)
      displacements[i] = MPI_Aint_diff(displacements[i], base_address);

    MPI_Type_create_struct(n, lengths.data(), displacements, types, &dt);
    MPI_Type_commit(&dt);
    return dt;
  }

  static MPI_Datatype usize_dt() { return MPI_UINT64_T; }

  static MPI_Datatype slice_info () {
    constexpr int n = 5;
    MPI_Datatype dt;
    Slice<F>::Info measure;
    const std::vector<int> lengths(n, 1);
    const MPI_Datatype types[n]
      = { vector(2, usize_dt())
        , vector(sizeof(enum Type), MPI_CHAR)
        , vector(sizeof(enum State), MPI_CHAR)
        , slice_location()
        , vector(sizeof(enum Type), MPI_CHAR)
        // TODO: Why this does not work on intel mpi?
        /*, MPI_UINT64_T*/
        };

    static_assert(sizeof(enum Type)  == 4, "Enum type not 4 bytes long");
    static_assert(sizeof(enum State) == 4, "Enum State not 4 bytes long");
    static_assert(sizeof(enum Name)  == 4, "Enum Name not 4 bytes long");

    // create the displacements from the info measurement struct
    size_t j = 0;
    MPI_Aint base_address, displacements[n];
    MPI_Get_address(&measure,             &base_address);
    MPI_Get_address(&measure.tuple[0],    &displacements[j++]);
    MPI_Get_address(&measure.type,        &displacements[j++]);
    MPI_Get_address(&measure.state,       &displacements[j++]);
    MPI_Get_address(&measure.from,        &displacements[j++]);
    MPI_Get_address(&measure.recycling,   &displacements[j++]);
    for (size_t i = 0; i < n; i++)
      displacements[i] = MPI_Aint_diff(displacements[i], base_address);

    MPI_Type_create_struct(n, lengths.data(), displacements, types, &dt);
    MPI_Type_commit(&dt);
    return dt;
  }

  static MPI_Datatype local_database_element () {
    constexpr int n = 2;
    MPI_Datatype dt;
    LocalDatabaseElement measure;
    const std::vector<int> lengths(n, 1);
    const MPI_Datatype types[n]
      = { vector(sizeof(enum Name), MPI_CHAR)
        , slice_info()
        };

    // measure the displacements in the struct
    size_t j = 0;
    MPI_Aint base_address, displacements[n];
    MPI_Get_address(&measure,      &base_address);
    MPI_Get_address(&measure.name, &displacements[j++]);
    MPI_Get_address(&measure.info, &displacements[j++]);
    for (size_t i = 0; i < n; i++)
      displacements[i] = MPI_Aint_diff(displacements[i], base_address);

    static_assert( sizeof(LocalDatabaseElement) == sizeof(measure)
                 , "Measure has bad size");

    MPI_Type_create_struct(n, lengths.data(), displacements, types, &dt);
    MPI_Type_commit(&dt);
    return vector(sizeof(LocalDatabaseElement), MPI_CHAR);
    // TODO: write tests in order to know if this works
    return dt;
  }

};

Static utilities

This section presents some functions which are useful to work with slices and are inside the namespace created by the slice struct.

The function subtuple_by_slice gives to every Slice::Type its meaning in terms of the triples \( (a,b,c) \).

Notice that since in general the relation \( a < b < c \) holds (in our implementation), the case of one-dimensional parametrizations A, B and C is well defined.

The function should only throw if there is an implementation error where the Slice::Type enum has been expanded and this function has not been updated accordingly.

static
PartialTuple subtuple_by_slice(ABCTuple abc, Type slice_type) {
  switch (slice_type) {
    case AB: return {abc[0], abc[1]};
    case BC: return {abc[1], abc[2]};
    case AC: return {abc[0], abc[2]};
    case CB: return {abc[2], abc[1]};
    case BA: return {abc[1], abc[0]};
    case CA: return {abc[2], abc[0]};
    case  A: return {abc[0], 0};
    case  B: return {abc[1], 0};
    case  C: return {abc[2], 0};
    default: throw "Switch statement not exhaustive!";
  }
}

In the context of cleaning up slices during the main loop, it is important to check if a given slice has some slices referencing to it in quality of recycled slices.

This function should therefore return a vector of pointers of slices referencing to the given slice’s info, when the length of the vector is zero, then there are no dangling links.

static std::vector<Slice<F>*> has_recycled_referencing_to_it
  ( std::vector<Slice<F>> &slices
  , Info const& info
  ) {
  std::vector<Slice<F>*> result;

  for (auto& s: slices)
    if (  s.info.recycling == info.type
       && s.info.tuple == info.tuple
       && s.info.state == Recycled
       ) result.push_back(&s);

  return result;
}

The rest of the coming functions are utilities in order to find in a vector of slices a given slice by reference. Mostly they are merely convenience wrappers to the standard library function std::find_if.

They are named as find<...>, where <...> represents some condition and must always return a reference to the found slice, i.e., Slice&. Atrip relies on these functions to find the sought for slices, therefore these functions will throw a std::domain_error if the given slice could not be found.

static Slice<F>& find_one_by_type(std::vector<Slice<F>> &slices, Slice<F>::Type type) {
    const auto slice_it
      = std::find_if(slices.begin(), slices.end(),
                     [&type](Slice<F> const& s) {
                       return type == s.info.type;
                     });
    WITH_CRAZY_DEBUG
    WITH_RANK
      << "\t__ looking for " << type << "\n";
    if (slice_it == slices.end())
      throw std::domain_error("Slice by type not found!");
    return *slice_it;
}
static Slice<F>&
find_recycled_source (std::vector<Slice<F>> &slices, Slice<F>::Info info) {
  const auto slice_it
    = std::find_if(slices.begin(), slices.end(),
                   [&info](Slice<F> const& s) {
                     return info.recycling == s.info.type
                         && info.tuple == s.info.tuple
                         && State::Recycled != s.info.state
                         ;
                   });

  WITH_CRAZY_DEBUG
  WITH_RANK << "__slice__:find: recycling source of "
            << pretty_print(info) << "\n";
  if (slice_it == slices.end())
    throw std::domain_error( "Slice not found: "
                           + pretty_print(info)
                           + " rank: "
                           + pretty_print(Atrip::rank)
                           );
  WITH_RANK << "__slice__:find: " << pretty_print(slice_it->info) << "\n";
  return *slice_it;
}
static Slice<F>& find_type_abc
  ( std::vector<Slice<F>> &slices
  , Slice<F>::Type type
  , ABCTuple const& abc
  ) {
    const auto tuple = Slice<F>::subtuple_by_slice(abc, type);
    const auto slice_it
      = std::find_if(slices.begin(), slices.end(),
                     [&type, &tuple](Slice<F> const& s) {
                       return type == s.info.type
                           && tuple == s.info.tuple
                           ;
                     });
    WITH_CRAZY_DEBUG
    WITH_RANK << "__slice__:find:" << type << " and tuple "
              << pretty_print(tuple)
              << "\n";
    if (slice_it == slices.end())
      throw std::domain_error( "Slice not found: "
                             + pretty_print(tuple)
                             + ", "
                             + pretty_print(type)
                             + " rank: "
                             + pretty_print(Atrip::rank)
                             );
    return *slice_it;
}
static Slice<F>& find_by_info(std::vector<Slice<F>> &slices,
                         Slice<F>::Info const& info) {
  const auto slice_it
    = std::find_if(slices.begin(), slices.end(),
                   [&info](Slice<F> const& s) {
                     // TODO: maybe implement comparison in Info struct
                     return info.type == s.info.type
                         && info.state == s.info.state
                         && info.tuple == s.info.tuple
                         && info.from.rank == s.info.from.rank
                         && info.from.source == s.info.from.source
                          ;
                   });
  WITH_CRAZY_DEBUG
  WITH_RANK << "__slice__:find:looking for " << pretty_print(info) << "\n";
  if (slice_it == slices.end())
    throw std::domain_error( "Slice by info not found: "
                           + pretty_print(info));
  return *slice_it;
}

Attributes

A slice object does not own data, it is just a container or a pointer to data together with additional bookkeeping facilities.

It includes an info structure with the information about the slice, Type, State etc, which will be later communicated to other ranks.

Info info;

A pointer to data is also necessary for the Slice but not necessary to be communicated to other ranks. The Slice should never allocate or deallocate itself the pointer.

    DataPtr<F> data;
#if defined(HAVE_CUDA)
    F* mpi_data;
#endif

An MPI_Request handle is also included so that the slices that are to receive data through MPI can know which request they belong to.

MPI_Request request;

For practical purposes in MPI calls, the number of elements in data is also included.

const size_t size;

Member functions

It is important to note that a ready slice should not be recycled from any other slice, so that it can have access by itself to the data.

void mark_ready() noexcept {
  info.state = Ready;
  info.recycling = Blank;
}

The following function asks wether or not the slice has effectively been unwrapped or not, i.e., wether or not the data are accessible and already there. This can only happen in two ways, either is the slice Ready or it is SelfSufficient, i.e., the data pointed to was pre-distributed to the current node.

bool is_unwrapped() const noexcept {
  return info.state == Ready
      || info.state == SelfSufficient
      ;
}

The function is_unwrappable answers which slices can be unwrapped potentially. Unwrapped slices can be unwrapped again idempotentially. Also Recycled slices can be unwrapped, i.e. the slices pointed to by them will be unwrapped. The only other possibility is that the slice has been dispatched in the past and can be unwrapped. The case where the state is Dispatched is the canonical intuitive case where a real process of unwrapping, i.e. waiting for the data to get through the network, is done.

bool is_unwrappable() const noexcept {
  return is_unwrapped()
      || info.state == Recycled
      || info.state == Dispatched
      ;
}

inline bool is_directly_fetchable() const noexcept {
  return info.state == Ready || info.state == Dispatched;
}

void free() noexcept {
  info.tuple      = {0, 0};
  info.type       = Blank;
  info.state      = Acceptor;
  info.from       = {0, 0};
  info.recycling  = Blank;
  data            = DataNullPtr;
}

inline bool is_free() const noexcept {
  return info.tuple       == PartialTuple{0, 0}
      && info.type        == Blank
      && info.state       == Acceptor
      && info.from.rank   == 0
      && info.from.source == 0
      && info.recycling   == Blank
      && data             == DataNullPtr
       ;
}

The function isRecylable answers the question, which slices can be recycled.

A slice can only be recycled if it is Fetch or Ready and has a valid datapointer.

In particular, SelfSufficient are not recyclable, since it is easier just to create a SelfSufficient slice than deal with data dependencies.

Furthermore, a recycled slice is not recyclable, if this is the case then it is either bad design or a bug.

inline bool is_recyclable() const noexcept {
  return (  info.state == Dispatched
         || info.state == Ready
         || info.state == Fetch
         )
      && has_valid_data_pointer()
      ;
}

The function has_valid_data_pointer describes if a slice has a valid data pointer.

This is important to know if the slice has some data to it, also some structural checks are done, so that it should not be Acceptor or Blank, if this is the case then it is a bug.

inline bool has_valid_data_pointer() const noexcept {
  return data       != DataNullPtr
      && info.state != Acceptor
      && info.type  != Blank
      ;
}

The function unwrap_and_mark_ready calls the low-level MPI functions in order to wait whenever the state of the slice is correct. The main behaviour of the function should

  • return if state is Ready, since then there is nothing to be done.
  • throw if the state is not Dispatched, only a dispatched slice can be unwrapped through MPI.
  • throw if an MPI error happens.
    void unwrap_and_mark_ready() {
      if (info.state == Ready) return;
      if (info.state != Dispatched)
        throw
          std::domain_error("Can't unwrap a non-ready, non-dispatched slice!");
      mark_ready();
      MPI_Status status;
#ifdef HAVE_OCD
        WITH_RANK << "__slice__:mpi: waiting " << "\n";
#endif
      const int error_code = MPI_Wait(&request, &status);
      if (MPI_SUCCESS != MPI_Request_free(&request))
        throw "Error freeing MPI request";
      if (error_code != MPI_SUCCESS)
        throw "MPI ERROR HAPPENED....";

#if defined(HAVE_CUDA)
      // copy the retrieved mpi data to the device
      cuMemcpyHtoD(data, (void*)mpi_data, sizeof(F) * size);
      std::free(mpi_data);
#endif

#ifdef HAVE_OCD
      char error_string[MPI_MAX_ERROR_STRING];
      int error_size;
      MPI_Error_string(error_code, error_string, &error_size);

      WITH_RANK << "__slice__:mpi: status "
                << "{ .source="    << status.MPI_SOURCE
                << ", .tag="       << status.MPI_TAG
                << ", .error="     << status.MPI_ERROR
                << ", .errCode="   << error_code
                << ", .err="       << error_string
                << " }"
                << "\n";
#endif
    }

Epilog

    Slice(size_t size_)
      : info({})
      , data(DataNullPtr)
#if defined(HAVE_CUDA)
      , mpi_data(nullptr)
#endif
      , size(size_)
      {}


  }; // struct Slice

Debug

template <typename F=double>
std::ostream& operator<<(std::ostream& out, typename Slice<F>::Location const& v) {
  // TODO: remove me
  out << "{.r(" << v.rank << "), .s(" << v.source << ")};";
  return out;
}

template <typename F=double>
std::ostream& operator<<(std::ostream& out, typename Slice<F>::Info const& i) {
  out << "«t" << i.type << ", s" << i.state << "»"
      << " ⊙ {" << i.from.rank << ", " << i.from.source << "}"
      << " ∴ {" << i.tuple[0] << ", " << i.tuple[1] << "}"
      << " ♲t" << i.recycling
      ;
  return out;
}

} // namespace atrip

Utils

This section presents some utilities

Prolog

#pragma once
#include <sstream>
#include <string>
#include <map>
#include <chrono>

#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wvla"
#pragma GCC diagnostic ignored "-Wint-in-bool-context"
#pragma GCC diagnostic ignored "-Wunused-parameter"
#pragma GCC diagnostic ignored "-Wdeprecated-copy"
#include <ctf.hpp>
#pragma GCC diagnostic pop

#include <atrip/Debug.hpp>


namespace atrip {

Pretty printing

The pretty printing uses the dbg-macro package.

#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wunused-parameter"
  template <typename T>
  std::string pretty_print(T&& value) {
    std::stringstream stream;
#if ATRIP_DEBUG > 2
    dbg::pretty_print(stream, std::forward<T>(value));
#endif
    return stream.str();
  }
#pragma GCC diagnostic pop

Chrono

The chrono is just a simple wrapper for a high resolution clock that can be found in the std::chrono namespace of the standard library.

#define WITH_CHRONO(__chrono_name, ...)         \
  Atrip::chrono[__chrono_name].start();         \
  __VA_ARGS__                                   \
  Atrip::chrono[__chrono_name].stop();

struct Timer {
  using Clock = std::chrono::high_resolution_clock;
  using Event = std::chrono::time_point<Clock>;
  std::chrono::duration<double> duration;
  Event _start;
  inline void start() noexcept { _start = Clock::now(); }
  inline void stop() noexcept { duration += Clock::now() - _start; }
  inline void clear() noexcept { duration *= 0; }
  inline double count() const noexcept { return duration.count(); }
};
using Timings = std::map<std::string, Timer>;

Epilog

}

The rank mapping

This section introduces the concept of rank mapping, which defines how slices will be allocated to every rank.

#pragma once

#include <vector>
#include <algorithm>

#include <atrip/Slice.hpp>
#include <atrip/Tuples.hpp>

namespace atrip {

  template <typename F=double>
  struct RankMap {

    static bool RANK_ROUND_ROBIN;
    std::vector<size_t> const lengths;
    size_t const np, size;
    ClusterInfo const cluster_info;

    RankMap(std::vector<size_t> lens, size_t np_, MPI_Comm comm)
      : lengths(lens)
      , np(np_)
      , size(std::accumulate(lengths.begin(), lengths.end(),
                            1UL, std::multiplies<size_t>()))
      , cluster_info(get_cluster_info(comm))
    { assert(lengths.size() <= 2); }

    size_t find(typename Slice<F>::Location const& p) const noexcept {
      if (RANK_ROUND_ROBIN) {
        return p.source * np + p.rank;
      } else {
        const size_t
          rank_position = p.source * cluster_info.ranks_per_node
                       + cluster_info.rank_infos[p.rank].local_rank
                       ;
        return rank_position * cluster_info.n_nodes
             + cluster_info.rank_infos[p.rank].node_id
             ;
      }
    }

    size_t n_sources() const noexcept {
      return size / np + size_t(size % np != 0);
    }


    bool is_padding_rank(size_t rank) const noexcept {
      return size % np == 0
          ? false
          : rank > (size % np - 1)
          ;
    }

    bool is_source_padding(const size_t rank, const size_t source)
      const noexcept {
      return source == n_sources() && is_padding_rank(rank);
    }

    typename Slice<F>::Location
    find(ABCTuple const& abc, typename Slice<F>::Type slice_type) const {
      // tuple = {11, 8} when abc = {11, 8, 9} and slice_type = AB
      // tuple = {11, 0} when abc = {11, 8, 9} and slice_type = A
      const auto tuple = Slice<F>::subtuple_by_slice(abc, slice_type);

      const size_t index
        = tuple[0]
        + tuple[1] * (lengths.size() > 1 ? lengths[0] : 0)
        ;

      size_t rank, source;

      if (RANK_ROUND_ROBIN) {

        rank = index % np;
        source = index / np;

      } else {

        size_t const

          // the node that will be assigned to
            node_id = index % cluster_info.n_nodes

          // how many times it has been assigned to the node
          , s_n = index / cluster_info.n_nodes

          // which local rank in the node should be
          , local_rank = s_n % cluster_info.ranks_per_node

          // and the local source (how many times we chose this local rank)
          , local_source = s_n / cluster_info.ranks_per_node
          ;

        // find the local_rank-th entry in cluster_info
        auto const& it =
          std::find_if(cluster_info.rank_infos.begin(),
                       cluster_info.rank_infos.end(),
                       [node_id, local_rank](RankInfo const& ri) {
                         return ri.node_id == node_id
                             && ri.local_rank == local_rank
                             ;
                       });
        if (it == cluster_info.rank_infos.end()) {
          throw "FATAL! Error in node distribution of the slices";
        }

        rank = (*it).global_rank;
        source = local_source;

      }

      return
        { rank
        , source
        };
    }

  };

}

The slice union

Prolog

#pragma once
#include <atrip/Debug.hpp>
#include <atrip/Slice.hpp>
#include <atrip/RankMap.hpp>

namespace atrip {
  template <typename F=double>
  class SliceUnion {
  public:
    using Tensor = CTF::Tensor<F>;

    virtual void
    slice_into_buffer(size_t iteration, Tensor &to, Tensor const& from) = 0;

    /*
     * This function should enforce an important property of a SliceUnion.
     * Namely, there can be no two Slices of the same nature.
     *
     * This means that there can be at most one slice with a given Ty_x_Tu.
     */
    void check_for_duplicates() const {
      std::vector<typename Slice<F>::Ty_x_Tu> tytus;
      for (auto const& s: slices) {
        if (s.is_free()) continue;
        tytus.push_back({s.info.type, s.info.tuple});
      }

      for (auto const& tytu: tytus) {
        if (std::count(tytus.begin(), tytus.end(), tytu) > 1)
          throw "Invariance violated, more than one slice with same Ty_x_Tu";
      }

    }

    std::vector<typename Slice<F>::Ty_x_Tu> needed_slices_for_tuple(ABCTuple const& abc) {
      std::vector<typename Slice<F>::Ty_x_Tu> needed(slice_types.size());
      // build the needed vector
      std::transform(slice_types.begin(), slice_types.end(),
                     needed.begin(),
                     [&abc](typename Slice<F>::Type const type) {
                       auto tuple = Slice<F>::subtuple_by_slice(abc, type);
                       return std::make_pair(type, tuple);
                     });
      return needed;
    }

    /* build_local_database
     *
     * It should build a database of slices so that we know what is needed
     * to fetch in the next iteration represented by the tuple 'abc'.
     *
     * 1. The algorithm works as follows, we build a database of the all
     * the slice types that we need together with their tuple.
     *
     * 2. Look in the SliceUnion if we already have this tuple,
     * if we already have it mark it (TODO)
     *
     * 3. If we don't have the tuple, look for a (state=acceptor, type=blank)
     * slice and mark this slice as type=Fetch with the corresponding type
     * and tuple.
     *
     * NOTE: The algorithm should certify that we always have enough blank
     * slices.
     *
     */
    typename
    Slice<F>::LocalDatabase build_local_database(ABCTuple const& abc) {
      typename Slice<F>::LocalDatabase result;

      auto const needed = needed_slices_for_tuple(abc);

      WITH_RANK << "__db__:needed:" << pretty_print(needed) << "\n";
      // BUILD THE DATABASE
      // we need to loop over all slice_types that this TensorUnion
      // is representing and find out how we will get the corresponding
      // slice for the abc we are considering right now.
      for (auto const& pair: needed) {
        auto const type = pair.first;
        auto const tuple = pair.second;
        auto const from  = rank_map.find(abc, type);

#ifdef HAVE_OCD
        WITH_RANK << "__db__:want:" << pretty_print(pair) << "\n";
        for (auto const& s: slices)
          WITH_RANK << "__db__:guts:ocd "
                    << s.info << " pt " << s.data
                    << "\n";
#endif

#ifdef HAVE_OCD
        WITH_RANK << "__db__: checking if exact match" << "\n";
#endif
        {
          // FIRST: look up if there is already a *Ready* slice matching what we
          // need
          auto const& it
            = std::find_if(slices.begin(), slices.end(),
                           [&tuple, &type](Slice<F> const& other) {
                             return other.info.tuple == tuple
                                 && other.info.type == type
                                    // we only want another slice when it
                                    // has already ready-to-use data
                                 && other.is_unwrappable()
                                 ;
                           });
          if (it != slices.end()) {
            // if we find this slice, it means that we don't have to do anything
            WITH_RANK << "__db__: EXACT: found EXACT in name=" << name
                      << " for tuple " << tuple[0] << ", " << tuple[1]
                      << " ptr " << it->data
                      << "\n";
            result.push_back({name, it->info});
            continue;
          }
        }

#ifdef HAVE_OCD
        WITH_RANK << "__db__: checking if recycle" << "\n";
#endif
        // Try to find a recyling possibility ie. find a slice with the same
        // tuple and that has a valid data pointer.
        auto const& recycle_it
          = std::find_if(slices.begin(), slices.end(),
                         [&tuple, &type](Slice<F> const& other) {
                           return other.info.tuple == tuple
                               && other.info.type != type
                               && other.is_recyclable()
                               ;
                         });

        // if we find this recylce, then we find a Blank slice
        // (which should exist by construction :THINK)
        //
        if (recycle_it != slices.end()) {
          auto& blank = Slice<F>::find_one_by_type(slices, Slice<F>::Blank);
          // TODO: formalize this through a method to copy information
          //       from another slice
          blank.data = recycle_it->data;
          blank.info.type = type;
          blank.info.tuple = tuple;
          blank.info.state = Slice<F>::Recycled;
          blank.info.from = from;
          blank.info.recycling = recycle_it->info.type;
          result.push_back({name, blank.info});
          WITH_RANK << "__db__: RECYCLING: n" << name
                    << " " << pretty_print(abc)
                    << " get " << pretty_print(blank.info)
                    << " from " << pretty_print(recycle_it->info)
                    << " ptr " << recycle_it->data
                    << "\n"
                    ;
          continue;
        }

        // in this case we have to create a new slice
        // this means that we should have a blank slice at our disposal
        // and also the free_pointers should have some elements inside,
        // so we pop a data pointer from the free_pointers container
#ifdef HAVE_OCD
        WITH_RANK << "__db__: none work, doing new" << "\n";
#endif
        {
          WITH_RANK << "__db__: NEW: finding blank in " << name
                    << " for type " << type
                    << " for tuple " << tuple[0] << ", " << tuple[1]
                    << "\n"
                    ;
          auto& blank = Slice<F>::find_one_by_type(slices, Slice<F>::Blank);
          blank.info.type = type;
          blank.info.tuple = tuple;
          blank.info.from = from;

          // Handle self sufficiency
          blank.info.state = Atrip::rank == from.rank
                           ? Slice<F>::SelfSufficient
                           : Slice<F>::Fetch
                           ;
          if (blank.info.state == Slice<F>::SelfSufficient) {
#if defined(HAVE_CUDA)
            blank.mpi_data = sources[from.source].data();
#else
            blank.data = sources[from.source].data();
#endif
          } else {
            if (free_pointers.size() == 0) {
              std::stringstream stream;
              stream << "No more free pointers "
                     << "for type " << type
                     << " and name " << name
                      ;
              throw std::domain_error(stream.str());
            }
            auto data_pointer = free_pointers.begin();
            free_pointers.erase(data_pointer);
            blank.data = *data_pointer;
          }

          result.push_back({name, blank.info});
          continue;
        }

      }

#ifdef HAVE_OCD
      for (auto const& s: slices)
        WITH_RANK << "__db__:guts:ocd:__end__ " << s.info << "\n";
#endif


      return result;

    }

    /*
     * Garbage collect slices not needed for the next iteration.
     *
     * It will throw if it tries to gc a slice that has not been
     * previously unwrapped, as a safety mechanism.
     */
    void clear_unused_slices_for_next_tuple(ABCTuple const& abc) {
      auto const needed = needed_slices_for_tuple(abc);

      // CLEAN UP SLICES, FREE THE ONES THAT ARE NOT NEEDED ANYMORE
      for (auto& slice: slices) {
        // if the slice is free, then it was not used anyways
        if (slice.is_free()) continue;


        // try to find the slice in the needed slices list
        auto const found
          = std::find_if(needed.begin(), needed.end(),
                         [&slice] (typename Slice<F>::Ty_x_Tu const& tytu) {
                           return slice.info.tuple == tytu.second
                               && slice.info.type == tytu.first
                               ;
                         });

        // if we did not find slice in needed, then erase it
        if (found == needed.end()) {

          // We have to be careful about the data pointer,
          // for SelfSufficient, the data pointer is a source pointer
          // of the slice, so we should just wipe it.
          //
          // For Ready slices, we have to be careful if there are some
          // recycled slices depending on it.
          bool free_slice_pointer = true;

          // allow to gc unwrapped and recycled, never Fetch,
          // if we have a Fetch slice then something has gone very wrong.
          if (!slice.is_unwrapped() && slice.info.state != Slice<F>::Recycled)
            throw
              std::domain_error("Trying to garbage collect "
                                " a non-unwrapped slice! "
                                + pretty_print(&slice)
                                + pretty_print(slice.info));

          // it can be that our slice is ready, but it has some hanging
          // references lying around in the form of a recycled slice.
          // Of course if we need the recycled slice the next iteration
          // this would be fatal, because we would then free the pointer
          // of the slice and at some point in the future we would
          // overwrite it. Therefore, we must check if slice has some
          // references in slices and if so then
          //
          //  - we should mark those references as the original (since the data
          //    pointer should be the same)
          //
          //  - we should make sure that the data pointer of slice
          //    does not get freed.
          //
          if (slice.info.state == Slice<F>::Ready) {
            WITH_OCD WITH_RANK
              << "__gc__:" << "checking for data recycled dependencies\n";
            auto recycled
              = Slice<F>::has_recycled_referencing_to_it(slices, slice.info);
            if (recycled.size()) {
              Slice<F>* new_ready = recycled[0];
              WITH_OCD WITH_RANK
                << "__gc__:" << "swaping recycled "
                << pretty_print(new_ready->info)
                << " and "
                << pretty_print(slice.info)
                << "\n";
              new_ready->mark_ready();
              assert(new_ready->data == slice.data);
              free_slice_pointer = false;

              for (size_t i = 1; i < recycled.size(); i++) {
                auto new_recycled = recycled[i];
                new_recycled->info.recycling = new_ready->info.type;
                WITH_OCD WITH_RANK
                  << "__gc__:" << "updating recycled "
                  << pretty_print(new_recycled->info)
                  << "\n";
              }

            }
          }

          // if the slice is self sufficient, do not dare touching the
          // pointer, since it is a pointer to our sources in our rank.
          if (  slice.info.state == Slice<F>::SelfSufficient
             || slice.info.state == Slice<F>::Recycled
             ) {
            free_slice_pointer = false;
          }

          // make sure we get its data pointer to be used later
          // only for non-recycled, since it can be that we need
          // for next iteration the data of the slice that the recycled points
          // to
          if (free_slice_pointer) {
            free_pointers.insert(slice.data);
            WITH_RANK << "~~~:cl(" << name << ")"
                      << " added to freePointer "
                      << pretty_print(free_pointers)
                      << "\n";
          } else {
            WITH_OCD WITH_RANK << "__gc__:not touching the free Pointer\n";
          }

          // at this point, let us blank the slice
          WITH_RANK << "~~~:cl(" << name << ")"
                    << " freeing up slice "
                    // TODO: make this possible because of Templates
                    // TODO: there is a deduction error here
                    // << " info " << slice.info
                    << "\n";
          slice.free();
        }

      }
    }

    // CONSTRUCTOR
    SliceUnion( std::vector<typename Slice<F>::Type> slice_types_
              , std::vector<size_t> slice_length_
              , std::vector<size_t> param_length
              , size_t np
              , MPI_Comm child_world
              , MPI_Comm global_world
              , typename Slice<F>::Name name_
              , size_t n_slice_buffers = 4
              )
              : rank_map(param_length, np, global_world)
              , world(child_world)
              , universe(global_world)
              , slice_length(slice_length_)
              , sources(rank_map.n_sources(),
                        std::vector<F>
                          (std::accumulate(slice_length.begin(),
                                           slice_length.end(),
                                           1UL, std::multiplies<size_t>())))
              , name(name_)
              , slice_types(slice_types_)
              , slice_buffers(n_slice_buffers)
              //, slices(2 * slice_types.size(), Slice<F>{ sources[0].size() })
    { // constructor begin

      LOG(0,"Atrip") << "INIT SliceUnion: " << name << "\n";

      for (auto& ptr: slice_buffers)
#if defined(HAVE_CUDA)
        cuMemAlloc(&ptr, sizeof(F) * sources[0].size());
#else
        ptr = (DataPtr<F>)malloc(sizeof(F) * sources[0].size());
#endif

      slices
        = std::vector<Slice<F>>(2 * slice_types.size(), { sources[0].size() });
      // TODO: think exactly    ^------------------- about this number

      // initialize the free_pointers with the pointers to the buffers
      std::transform(slice_buffers.begin(), slice_buffers.end(),
                     std::inserter(free_pointers, free_pointers.begin()),
                     [](DataPtr<F> ptr) { return ptr; });



      LOG(1,"Atrip") << "rank_map.n_sources "
                           << rank_map.n_sources() << "\n";
      LOG(1,"Atrip") << "#slices "
                           << slices.size() << "\n";
      LOG(1,"Atrip") << "#slices[0] "
                           << slices[0].size << "\n";
      LOG(1,"Atrip") << "#sources "
                           << sources.size() << "\n";
      LOG(1,"Atrip") << "#sources[0] "
                           << sources[0].size() << "\n";
      LOG(1,"Atrip") << "#free_pointers "
                           << free_pointers.size() << "\n";
      LOG(1,"Atrip") << "#slice_buffers "
                           << slice_buffers.size() << "\n";
      LOG(1,"Atrip") << "#slice_length "
                           << slice_length.size() << "\n";
      LOG(1,"Atrip") << "#param_length "
                           << param_length.size() << "\n";
      LOG(1,"Atrip") << "GB*" << np << " "
                           << double(sources.size() + slice_buffers.size())
                            * sources[0].size()
                            * 8 * np
                            / 1073741824.0
                           << "\n";
    } // constructor ends

    void init(Tensor const& source_tensor) {

      CTF::World w(world);
      const int rank = Atrip::rank
              , order = slice_length.size()
              ;
      std::vector<int> const syms(order, NS);
      std::vector<int> __slice_length(slice_length.begin(), slice_length.end());
      Tensor to_slice_into(order,
                         __slice_length.data(),
                         syms.data(),
                         w);
      LOG(1,"Atrip") << "slicing... \n";

      // setUp sources
      for (size_t it(0); it < rank_map.n_sources(); ++it) {
        const size_t
          source = rank_map.is_source_padding(rank, source) ? 0 : it;
        WITH_OCD
        WITH_RANK
          << "Init:to_slice_into it-" << it
          << " :: source " << source << "\n";
        slice_into_buffer(source, to_slice_into, source_tensor);
      }

    }

    /**
     * \brief Send asynchronously only if the state is Fetch
     */
    void send( size_t other_rank
             , typename Slice<F>::LocalDatabaseElement const& el
             , size_t tag) const noexcept {
      MPI_Request request;
      bool send_data_p = false;
      auto const& info = el.info;

      if (info.state == Slice<F>::Fetch) send_data_p = true;
      // TODO: remove this because I have SelfSufficient
      if (other_rank == info.from.rank)      send_data_p = false;
      if (!send_data_p) return;

      MPI_Isend( sources[info.from.source].data()
               , sources[info.from.source].size()
               , traits::mpi::datatype_of<F>()
               , other_rank
               , tag
               , universe
               , &request
               );
      WITH_CRAZY_DEBUG
      WITH_RANK << "sent to " << other_rank << "\n";

    }

    /**
     * \brief Receive asynchronously only if the state is Fetch
     */
    void receive(typename Slice<F>::Info const& info, size_t tag) noexcept {
      auto& slice = Slice<F>::find_by_info(slices, info);

      if (Atrip::rank == info.from.rank) return;

      if (slice.info.state == Slice<F>::Fetch) {
        // TODO: do it through the slice class
        slice.info.state = Slice<F>::Dispatched;
#if defined(HAVE_CUDA)
        slice.mpi_data = (F*)malloc(sizeof(F) * slice.size);
        MPI_Irecv( slice.mpi_data
#else
        MPI_Irecv( slice.data
#endif
                 , slice.size
                 , traits::mpi::datatype_of<F>()
                 , info.from.rank
                 , tag
                 , universe
                 , &slice.request
                //, MPI_STATUS_IGNORE
                 );
      }
    }

    void unwrap_all(ABCTuple const& abc) {
      for (auto type: slice_types) unwrap_slice(type, abc);
    }

    DataPtr<F> unwrap_slice(typename Slice<F>::Type type, ABCTuple const& abc) {
      WITH_CRAZY_DEBUG
      WITH_RANK << "__unwrap__:slice " << type << " w n "
                << name
                << " abc" << pretty_print(abc)
                << "\n";
      auto& slice = Slice<F>::find_type_abc(slices, type, abc);
      //WITH_RANK << "__unwrap__:info " << slice.info << "\n";
      switch  (slice.info.state) {
        case Slice<F>::Dispatched:
          WITH_RANK << "__unwrap__:Fetch: " << &slice
                    << " info " << pretty_print(slice.info)
                    << "\n";
          slice.unwrap_and_mark_ready();
          return slice.data;
          break;
        case Slice<F>::SelfSufficient:
          WITH_RANK << "__unwrap__:SelfSufficient: " << &slice
                    << " info " << pretty_print(slice.info)
                    << "\n";
          return slice.data;
          break;
        case Slice<F>::Ready:
          WITH_RANK << "__unwrap__:READY: UNWRAPPED ALREADY" << &slice
                    << " info " << pretty_print(slice.info)
                    << "\n";
          return slice.data;
          break;
        case Slice<F>::Recycled:
          WITH_RANK << "__unwrap__:RECYCLED " << &slice
                    << " info " << pretty_print(slice.info)
                    << "\n";
          return unwrap_slice(slice.info.recycling, abc);
          break;
        case Slice<F>::Fetch:
        case Slice<F>::Acceptor:
          throw std::domain_error("Can't unwrap an acceptor or fetch slice!");
          break;
        default:
          throw std::domain_error("Unknown error unwrapping slice!");
      }
      return slice.data;
    }

    ~SliceUnion() {
      for (auto& ptr: slice_buffers)
#if defined(HAVE_CUDA)
        cuMemFree(ptr);
#else
        std::free(ptr);
#endif
    }

    const RankMap<F> rank_map;
    const MPI_Comm world;
    const MPI_Comm universe;
    const std::vector<size_t> slice_length;
    std::vector< std::vector<F> > sources;
    std::vector< Slice<F> > slices;
    typename Slice<F>::Name name;
    const std::vector<typename Slice<F>::Type> slice_types;
    std::vector< DataPtr<F> > slice_buffers;
    std::set< DataPtr<F> > free_pointers;

  };

  template <typename F=double>
  SliceUnion<F>&
  union_by_name(std::vector<SliceUnion<F>*> const& unions,
              typename Slice<F>::Name name) {
      const auto slice_union_it
        = std::find_if(unions.begin(), unions.end(),
                      [&name](SliceUnion<F> const* s) {
                        return name == s->name;
                      });
      if (slice_union_it == unions.end()) {
        std::stringstream stream;
        stream << "SliceUnion(" << name << ") not found!";
        throw std::domain_error(stream.str());
      }
      return **slice_union_it;
  }

Epilog

}

Tuples

This section introduces the types for tuples \( (a,b,c) \) as well as their distribution to nodes and cores.

Prolog

#pragma once

#include <vector>
#include <array>
#include <numeric>

// TODO: remove some
#include <stdio.h>
#include <math.h>
#include <algorithm>
#include <map>
#include <cassert>
#include <chrono>
#include <climits>
#include <mpi.h>

#include <atrip/Utils.hpp>
#include <atrip/Debug.hpp>

namespace atrip {

Tuples types

The main tuple types are simple type aliases for finite-size arrays. A tuple is thus simply 3 natural numbers \( (a,b,c) \) whereas a partial tuple is a two dimensional subset of these three.

using ABCTuple = std::array<size_t, 3>;
using PartialTuple = std::array<size_t, 2>;
using ABCTuples = std::vector<ABCTuple>;

constexpr ABCTuple FAKE_TUPLE = {0, 0, 0};
constexpr ABCTuple INVALID_TUPLE = {1, 1, 1};

Distributing the tuples

In general it is our task to distribute all the tuples \( (a,b,c) \) among the ranks. Every distribution should make sure to allocate the same amount of tuples to every rank, padding the list with FAKE_TUPLE elements as necessary.

The interface that we propose for this is simplye

struct TuplesDistribution {
  virtual ABCTuples get_tuples(size_t Nv, MPI_Comm universe) = 0;
  virtual bool tuple_is_fake(ABCTuple const& t) { return t == FAKE_TUPLE; }
};

Node information

node_list
List of hostnames of size \( N_n \)
node_infos
List of (hostname, local rank Id) of size \( N_p \), i.e., size of ranks where local rank id goes from 0 to 48.

get_node_names gets the names of the nodes used, i.e., the size of the resulting vector gives the number of nodes.

std::vector<std::string> get_node_names(MPI_Comm comm){
  int rank, np;
  MPI_Comm_rank(comm, &rank);
  MPI_Comm_size(comm, &np);

  std::vector<std::string> node_list(np);
  char node_name[MPI_MAX_PROCESSOR_NAME];
  char *node_names = (char*)malloc(np * MPI_MAX_PROCESSOR_NAME);
  std::vector<int> name_lengths(np)
                 , off(np)
                 ;
  int name_length;
  MPI_Get_processor_name(node_name, &name_length);
  MPI_Allgather(&name_length,
                1,
                MPI_INT,
                name_lengths.data(),
                1,
                MPI_INT,
                comm);
  for (int i(1); i < np; i++)
    off[i] = off[i-1] + name_lengths[i-1];
  MPI_Allgatherv(node_name,
                 name_lengths[rank],
                 MPI_BYTE,
                 node_names,
                 name_lengths.data(),
                 off.data(),
                 MPI_BYTE,
                 comm);
  for (int i(0); i < np; i++) {
    std::string const s(&node_names[off[i]], name_lengths[i]);
    node_list[i] = s;
  }
  std::free(node_names);
  return node_list;
}

get_node_infos

struct RankInfo {
  const std::string name;
  const size_t node_id;
  const size_t global_rank;
  const size_t local_rank;
  const size_t ranks_per_node;
};

template <typename A>
A unique(A const &xs) {
  auto result = xs;
  std::sort(std::begin(result), std::end(result));
  auto const& last = std::unique(std::begin(result), std::end(result));
  result.erase(last, std::end(result));
  return result;
}

std::vector<RankInfo>
get_node_infos(std::vector<string> const& node_names) {
  std::vector<RankInfo> result;
  auto const unique_names = unique(node_names);
  auto const index = [&unique_names](std::string const& s) {
    auto const& it = std::find(unique_names.begin(), unique_names.end(), s);
    return std::distance(unique_names.begin(), it);
  };
  std::vector<size_t> local_ranks(unique_names.size(), 0);
  size_t global_rank = 0;
  for (auto const& name: node_names) {
    const size_t node_id = index(name);
    result.push_back({name,
                      node_id,
                      global_rank++,
                      local_ranks[node_id]++,
                      (size_t)
                      std::count(node_names.begin(),
                                 node_names.end(),
                                 name)
                      });
  }
  return result;
}

struct ClusterInfo {
  const size_t n_nodes, np, ranks_per_node;
  const std::vector<RankInfo> rank_infos;
};

ClusterInfo
get_cluster_info(MPI_Comm comm) {
  auto const names = get_node_names(comm);
  auto const rank_infos = get_node_infos(names);

  return ClusterInfo {
    unique(names).size(),
    names.size(),
    rank_infos[0].ranks_per_node,
    rank_infos
  };

}

Naive list

The naive implementation of the global tuples list is simple three for loops creating tuples of the sort \( (a,b,c) \) where the following conditions are met at the same time:

  • \( a \leq b \leq c \)
  • \( a ≠ b ∧ b ≠ c \)

This means, \( (1, 2, 3) , (1, 1, 3) , (1, 2, 2) \) are acceptable tuples wherease \( (2, 1, 1) \) and \( (1, 1, 1) \) are not.

ABCTuples get_tuples_list(size_t Nv, size_t rank, size_t np) {

  const size_t
    // total number of tuples for the problem
       n = Nv * (Nv + 1) * (Nv + 2) / 6 - Nv

    // all ranks should have the same number of tuples_per_rank
    , tuples_per_rank = n / np + size_t(n % np != 0)

    // start index for the global tuples list
    , start = tuples_per_rank * rank

    // end index for the global tuples list
    , end = tuples_per_rank * (rank + 1)
    ;

  LOG(1,"Atrip") << "tuples_per_rank = " << tuples_per_rank << "\n";
  WITH_RANK << "start, end = " << start << ", " << end << "\n";
  ABCTuples result(tuples_per_rank, FAKE_TUPLE);

  for (size_t a(0), r(0), g(0); a < Nv; a++)
  for (size_t b(a);             b < Nv; b++)
  for (size_t c(b);             c < Nv; c++){
    if ( a == b && b == c ) continue;
    if ( start <= g && g < end) result[r++] = {a, b, c};
    g++;
  }

  return result;

}

and all tuples would simply be

ABCTuples get_all_tuples_list(const size_t Nv) {
  const size_t n = Nv * (Nv + 1) * (Nv + 2) / 6 - Nv;
  ABCTuples result(n);

  for (size_t a(0), u(0); a < Nv; a++)
  for (size_t b(a); b < Nv; b++)
  for (size_t c(b); c < Nv; c++){
    if ( a == b && b == c ) continue;
    result[u++] = {a, b, c};
  }

  return result;
}

With getTupleList we can easily define a tuple distribution like

struct NaiveDistribution : public TuplesDistribution {
  ABCTuples get_tuples(size_t Nv, MPI_Comm universe) override {
    int rank, np;
    MPI_Comm_rank(universe, &rank);
    MPI_Comm_size(universe, &np);
    return get_tuples_list(Nv, (size_t)rank, (size_t)np);
  }
};

Group and sort list

Prolog

namespace group_and_sort {

Utils

// Provides the node on which the slice-element is found
// Right now we distribute the slices in a round robin fashion
// over the different nodes (NOTE: not mpi ranks but nodes)
inline
size_t is_on_node(size_t tuple, size_t n_nodes) { return tuple % n_nodes; }


// return the node (or all nodes) where the elements of this
// tuple are located
std::vector<size_t> get_tuple_nodes(ABCTuple const& t, size_t n_nodes) {
  std::vector<size_t>
    n_tuple = { is_on_node(t[0], n_nodes)
             , is_on_node(t[1], n_nodes)
             , is_on_node(t[2], n_nodes)
             };
  return unique(n_tuple);
}

struct Info {
  size_t n_nodes;
  size_t node_id;
};

Distribution

wording: home element = element which is located on the given node

  1. we distribute the tuples such that each tuple has at least one ‘home element’
  2. we sort each tuple in a way that the ‘home element’ are the fastest indices
  3. we sort the list of tuples on every node
  4. we resort the tuples that for every tuple abc the following holds: a<b<c
ABCTuples special_distribution(Info const& info, ABCTuples const& all_tuples) {

  ABCTuples node_tuples;
  size_t const n_nodes(info.n_nodes);

  std::vector<ABCTuples>
      container1d(n_nodes)
    , container2d(n_nodes * n_nodes)
    , container3d(n_nodes * n_nodes * n_nodes)
    ;

  WITH_DBG if (info.node_id == 0)
    std::cout << "\tGoing through all "
              << all_tuples.size()
              << " tuples in "
              << n_nodes
              << " nodes\n";

  // build container-n-d's
  for (auto const& t: all_tuples) {
    // one which node(s) are the tuple elements located...
    // put them into the right container
    auto const _nodes = get_tuple_nodes(t, n_nodes);

    switch (_nodes.size()) {
      case 1:
        container1d[_nodes[0]].push_back(t);
        break;
      case 2:
        container2d[ _nodes[0]
                   + _nodes[1] * n_nodes
                   ].push_back(t);
        break;
      case 3:
        container3d[ _nodes[0]
                   + _nodes[1] * n_nodes
                   + _nodes[2] * n_nodes * n_nodes
                   ].push_back(t);
        break;
    }

  }

  WITH_DBG if (info.node_id == 0)
    std::cout << "\tBuilding 1-d containers\n";
  // DISTRIBUTE 1-d containers
  // every tuple which is only located at one node belongs to this node
  {
    auto const& _tuples = container1d[info.node_id];
    node_tuples.resize(_tuples.size(), INVALID_TUPLE);
    std::copy(_tuples.begin(), _tuples.end(), node_tuples.begin());
  }

  WITH_DBG if (info.node_id == 0)
    std::cout << "\tBuilding 2-d containers\n";
  // DISTRIBUTE 2-d containers
  //the tuples which are located at two nodes are half/half given to these nodes
  for (size_t yx = 0; yx < container2d.size(); yx++) {

    auto const& _tuples = container2d[yx];
      const
    size_t idx = yx % n_nodes
         // remeber: yx = idy * n_nodes + idx
         , idy = yx / n_nodes
         , n_half = _tuples.size() / 2
         , size = node_tuples.size()
         ;

    size_t nbeg, nend;
    if (info.node_id == idx) {
      nbeg = 0 * n_half;
      nend = n_half;
    } else if (info.node_id == idy) {
      nbeg = 1 * n_half;
      nend = _tuples.size();
    } else {
      // either idx or idy is my node
      continue;
    }

    size_t const nextra = nend - nbeg;
    node_tuples.resize(size + nextra, INVALID_TUPLE);
    std::copy(_tuples.begin() + nbeg,
              _tuples.begin() + nend,
              node_tuples.begin() + size);

  }

  WITH_DBG if (info.node_id == 0)
    std::cout << "\tBuilding 3-d containers\n";
  // DISTRIBUTE 3-d containers
  for (size_t zyx = 0; zyx < container3d.size(); zyx++) {
    auto const& _tuples = container3d[zyx];

      const
    size_t idx = zyx % n_nodes
         , idy = (zyx / n_nodes) % n_nodes
         // remember: zyx = idx + idy * n_nodes + idz * n_nodes^2
         , idz = zyx / n_nodes / n_nodes
         , n_third = _tuples.size() / 3
         , size = node_tuples.size()
         ;

    size_t nbeg, nend;
    if (info.node_id == idx) {
      nbeg = 0 * n_third;
      nend = 1 * n_third;
    } else if (info.node_id == idy) {
      nbeg = 1 * n_third;
      nend = 2 * n_third;
    } else if (info.node_id == idz) {
      nbeg = 2 * n_third;
      nend = _tuples.size();
    } else {
      // either idx or idy or idz is my node
      continue;
    }

    size_t const nextra = nend - nbeg;
    node_tuples.resize(size + nextra, INVALID_TUPLE);
    std::copy(_tuples.begin() + nbeg,
              _tuples.begin() + nend,
              node_tuples.begin() + size);

  }


  WITH_DBG if (info.node_id == 0) std::cout << "\tswapping tuples...\n";
  /*
   *  sort part of group-and-sort algorithm
   *  every tuple on a given node is sorted in a way that
   *  the 'home elements' are the fastest index.
   *  1:yyy 2:yyn(x) 3:yny(x) 4:ynn(x) 5:nyy 6:nyn(x) 7:nny 8:nnn
   */
  for (auto &nt: node_tuples){
    if ( is_on_node(nt[0], n_nodes) == info.node_id ){ // 1234
      if ( is_on_node(nt[2], n_nodes) != info.node_id ){ // 24
        size_t const x(nt[0]);
        nt[0] = nt[2];         // switch first and last
        nt[2] = x;
      }
      else if ( is_on_node(nt[1], n_nodes) != info.node_id){ // 3
        size_t const x(nt[0]);
        nt[0] = nt[1];         // switch first two
        nt[1] = x;
      }
    } else {
      if ( is_on_node(nt[1], n_nodes) == info.node_id   // 56
        && is_on_node(nt[2], n_nodes) != info.node_id
        ) { // 6
        size_t const x(nt[1]);
        nt[1] = nt[2];         // switch last two
        nt[2] = x;
      }
    }
  }

  WITH_DBG if (info.node_id == 0) std::cout << "\tsorting list of tuples...\n";
  //now we sort the list of tuples
  std::sort(node_tuples.begin(), node_tuples.end());

  WITH_DBG if (info.node_id == 0) std::cout << "\trestoring tuples...\n";
  // we bring the tuples abc back in the order a<b<c
  for (auto &t: node_tuples)  std::sort(t.begin(), t.end());

#if ATRIP_DEBUG > 1
  WITH_DBG if (info.node_id == 0)
  std::cout << "checking for validity of " << node_tuples.size() << std::endl;
  const bool any_invalid
    = std::any_of(node_tuples.begin(),
                  node_tuples.end(),
                  [](ABCTuple const& t) { return t == INVALID_TUPLE; });
  if (any_invalid) throw "Some tuple is invalid in group-and-sort algorithm";
#endif

  WITH_DBG if (info.node_id == 0) std::cout << "\treturning tuples...\n";
  return node_tuples;

}

Main

The main routine should return the list of tuples to be handled by the current rank.

Let \( N_p \) be the number of ranks or processes. Let \( N_n \) be the number of nodes or sockets.

Then we have the following

Global rank | 0 1 2 3 4 5 6 7 8
key         | global rank
node_id      | 0 1 0 1 1 0 2 2 2
Local rank  | 0 0 1 1 2 2 0 1 2
intra color | 0 1 0 1 1 0 2 2 2
std::vector<ABCTuple> main(MPI_Comm universe, size_t Nv) {

  int rank, np;
  MPI_Comm_rank(universe, &rank);
  MPI_Comm_size(universe, &np);

  std::vector<ABCTuple> result;

  auto const node_names(get_node_names(universe));
  size_t const n_nodes = unique(node_names).size();
  auto const node_infos = get_node_infos(node_names);

  // We want to construct a communicator which only contains of one
  // element per node
  bool const compute_distribution_p
    = node_infos[rank].local_rank == 0;

  std::vector<ABCTuple>
    node_tuples
      = compute_distribution_p
      ? special_distribution(Info{n_nodes, node_infos[rank].node_id},
                            get_all_tuples_list(Nv))
      : std::vector<ABCTuple>()
      ;

  LOG(1,"Atrip") << "got node_tuples\n";

  // now we have to send the data from **one** rank on each node
  // to all others ranks of this node
    const
  int color = node_infos[rank].node_id
    , key = node_infos[rank].local_rank
    ;


  MPI_Comm INTRA_COMM;
  MPI_Comm_split(universe, color, key, &INTRA_COMM);

Every node has to distribute **at least** node_tuples.size() / node_infos[rank].ranks_per_node tuples among the ranks.

We have to communicate this quantity among all nodes.

size_t const
  tuples_per_rank_local
     = node_tuples.size() / node_infos[rank].ranks_per_node
     + size_t(node_tuples.size() % node_infos[rank].ranks_per_node != 0)
     ;

size_t tuples_per_rank_global;

MPI_Reduce(&tuples_per_rank_local,
           &tuples_per_rank_global,
           1,
           MPI_UINT64_T,
           MPI_MAX,
           0,
           universe);

MPI_Bcast(&tuples_per_rank_global,
          1,
          MPI_UINT64_T,
          0,
          universe);

LOG(1,"Atrip") << "Tuples per rank: " << tuples_per_rank_global << "\n";
LOG(1,"Atrip") << "ranks per node " << node_infos[rank].ranks_per_node << "\n";
LOG(1,"Atrip") << "#nodes " << n_nodes << "\n";

Now we have the tuples that every rank has to have, i.e., tuples_per_rank_global.

However before this, the tuples in node_tuples now have to be sent from the local rank in every node to all the ranks in the given node, and we have to make sure that every rank inside a given node gets the same amount of tuples, in this case it should be tuples_per_rank_local, and in our node the total number of tuples should be tuples_per_rank_local * node_infos[rank].ranks_per_node, however this might not be the case up to now due to divisibility issues.

Up to now we have exactly node_tuples.size() tuples, we have to make sure by resizing that the condition above is met, i.e., so we can resize and add some fake tuples at the end as padding.

size_t const total_tuples
  = tuples_per_rank_global * node_infos[rank].ranks_per_node;

if (compute_distribution_p) {
  // pad with FAKE_TUPLEs
  node_tuples.insert(node_tuples.end(),
                    total_tuples - node_tuples.size(),
                    FAKE_TUPLE);
}

And now we can simply scatter the tuples in node_tuples and send tuples_per_rank_global to the different ranks in the node,

{
  // construct mpi type for abctuple
  MPI_Datatype MPI_ABCTUPLE;
  MPI_Type_vector(node_tuples[0].size(), 1, 1, MPI_UINT64_T, &MPI_ABCTUPLE);
  MPI_Type_commit(&MPI_ABCTUPLE);

  LOG(1,"Atrip") << "scattering tuples \n";

  result.resize(tuples_per_rank_global);
  MPI_Scatter(node_tuples.data(),
              tuples_per_rank_global,
              MPI_ABCTUPLE,
              result.data(),
              tuples_per_rank_global,
              MPI_ABCTUPLE,
              0,
              INTRA_COMM);

  MPI_Type_free(&MPI_ABCTUPLE);

}

The next step is sending the tuples in the local root rank to the other ranks in the node, this we do with the MPI function MPI_Scatterv. Every rank gets tuples_per_rank_local tuples and the node_tuples vector is now homogeneous and divisible by the number of ranks per node in our node. Therefore, the displacements are simply the vector $$ \left\{ k * \mathrm{tuplesPerNodeLocal} \mid k ∈ \left\{ 0 , \ldots , \#\text{ranks in node} - 1 \right\} \right\} $$

and the sendCounts vector is simply the constant vector tuples_per_rank_local of size ranks_per_node.

  return result;

}

Interface

The distribution interface will then simply be

struct Distribution : public TuplesDistribution {
  ABCTuples get_tuples(size_t Nv, MPI_Comm universe) override {
    return main(universe, Nv);
  }
};

Epilog

} // namespace group_and_sort

Epilog

}

Unions

Every slice pertaining to every different tensor is sliced differently.

#pragma once
#include <atrip/SliceUnion.hpp>

namespace atrip {

  template <typename F=double>
  void slice_into_vector
    ( std::vector<F> &v
    , CTF::Tensor<F> &to_slice
    , std::vector<int64_t> const low
    , std::vector<int64_t> const up
    , CTF::Tensor<F> const& origin
    , std::vector<int64_t> const origin_low
    , std::vector<int64_t> const origin_up
    ) {
    // Thank you CTF for forcing me to do this
    struct { std::vector<int> up, low; }
        to_slice_ = { {up.begin(), up.end()}
                   , {low.begin(), low.end()} }
      , origin_ = { {origin_up.begin(), origin_up.end()}
                  , {origin_low.begin(), origin_low.end()} }
      ;

    WITH_OCD
    WITH_RANK << "slicing into " << pretty_print(to_slice_.up)
                          << "," << pretty_print(to_slice_.low)
              << " from " << pretty_print(origin_.up)
                   << "," << pretty_print(origin_.low)
              << "\n";

#ifndef ATRIP_DONT_SLICE
    to_slice.slice( to_slice_.low.data()
                 , to_slice_.up.data()
                 , 0.0
                 , origin
                 , origin_.low.data()
                 , origin_.up.data()
                 , 1.0);
    memcpy(v.data(), to_slice.data, sizeof(F) * v.size());
#endif

  }


  template <typename F=double>
  struct TAPHH : public SliceUnion<F> {
    TAPHH( CTF::Tensor<F> const& source_tensor
         , size_t No
         , size_t Nv
         , size_t np
         , MPI_Comm child_world
         , MPI_Comm global_world
         ) : SliceUnion<F>( {Slice<F>::A, Slice<F>::B, Slice<F>::C}
                          , {Nv, No, No} // size of the slices
                          , {Nv}
                          , np
                          , child_world
                          , global_world
                          , Slice<F>::TA
                          , 6) {
           this->init(source_tensor);
         }

    void slice_into_buffer(size_t it, CTF::Tensor<F> &to, CTF::Tensor<F> const& from) override
    {
      const int Nv = this->slice_length[0]
              , No = this->slice_length[1]
              , a = this->rank_map.find({static_cast<size_t>(Atrip::rank), it});
              ;


      slice_into_vector<F>( this->sources[it]
                        , to,   {0, 0, 0},    {Nv, No, No}
                        , from, {a, 0, 0, 0}, {a+1, Nv, No, No}
                        );

    }

  };


  template <typename F=double>
  struct HHHA : public SliceUnion<F> {
    HHHA( CTF::Tensor<F> const& source_tensor
        , size_t No
        , size_t Nv
        , size_t np
        , MPI_Comm child_world
        , MPI_Comm global_world
        ) : SliceUnion<F>( {Slice<F>::A, Slice<F>::B, Slice<F>::C}
                         , {No, No, No} // size of the slices
                         , {Nv}         // size of the parametrization
                         , np
                         , child_world
                         , global_world
                         , Slice<F>::VIJKA
                         , 6) {
           this->init(source_tensor);
         }

    void slice_into_buffer(size_t it, CTF::Tensor<F> &to, CTF::Tensor<F> const& from) override
    {

      const int No = this->slice_length[0]
              , a = this->rank_map.find({static_cast<size_t>(Atrip::rank), it})
              ;

      slice_into_vector<F>( this->sources[it]
                        , to,   {0, 0, 0},    {No, No, No}
                        , from, {0, 0, 0, a}, {No, No, No, a+1}
                        );

    }
  };

  template <typename F=double>
  struct ABPH : public SliceUnion<F> {
    ABPH( CTF::Tensor<F> const& source_tensor
        , size_t No
        , size_t Nv
        , size_t np
        , MPI_Comm child_world
        , MPI_Comm global_world
        ) : SliceUnion<F>( { Slice<F>::AB, Slice<F>::BC, Slice<F>::AC
                           , Slice<F>::BA, Slice<F>::CB, Slice<F>::CA
                           }
                         , {Nv, No} // size of the slices
                         , {Nv, Nv} // size of the parametrization
                         , np
                         , child_world
                         , global_world
                         , Slice<F>::VABCI
                         , 2*6) {
           this->init(source_tensor);
         }

    void slice_into_buffer(size_t it, CTF::Tensor<F> &to, CTF::Tensor<F> const& from) override {

      const int Nv = this->slice_length[0]
              , No = this->slice_length[1]
              , el = this->rank_map.find({static_cast<size_t>(Atrip::rank), it})
              , a = el % Nv
              , b = el / Nv
              ;


      slice_into_vector<F>( this->sources[it]
                        , to,   {0, 0},       {Nv, No}
                        , from, {a, b, 0, 0}, {a+1, b+1, Nv, No}
                        );

    }

  };

  template <typename F=double>
  struct ABHH : public SliceUnion<F> {
    ABHH( CTF::Tensor<F> const& source_tensor
        , size_t No
        , size_t Nv
        , size_t np
        , MPI_Comm child_world
        , MPI_Comm global_world
        ) : SliceUnion<F>( {Slice<F>::AB, Slice<F>::BC, Slice<F>::AC}
                         , {No, No} // size of the slices
                         , {Nv, Nv} // size of the parametrization
                         , np
                         , child_world
                         , global_world
                         , Slice<F>::VABIJ
                         , 6) {
           this->init(source_tensor);
         }

    void slice_into_buffer(size_t it, CTF::Tensor<F> &to, CTF::Tensor<F> const& from) override {

      const int Nv = from.lens[0]
              , No = this->slice_length[1]
              , el = this->rank_map.find({static_cast<size_t>(Atrip::rank), it})
              , a = el % Nv
              , b = el / Nv
              ;

      slice_into_vector<F>( this->sources[it]
                        , to,   {0, 0},       {No, No}
                        , from, {a, b, 0, 0}, {a+1, b+1, No, No}
                        );


    }

  };


  template <typename F=double>
  struct TABHH : public SliceUnion<F> {
    TABHH( CTF::Tensor<F> const& source_tensor
         , size_t No
         , size_t Nv
         , size_t np
         , MPI_Comm child_world
         , MPI_Comm global_world
         ) : SliceUnion<F>( {Slice<F>::AB, Slice<F>::BC, Slice<F>::AC}
                          , {No, No} // size of the slices
                          , {Nv, Nv} // size of the parametrization
                          , np
                          , child_world
                          , global_world
                          , Slice<F>::TABIJ
                          , 6) {
           this->init(source_tensor);
         }

    void slice_into_buffer(size_t it, CTF::Tensor<F> &to, CTF::Tensor<F> const& from) override {
      // TODO: maybe generalize this with ABHH

      const int Nv = from.lens[0]
              , No = this->slice_length[1]
              , el = this->rank_map.find({static_cast<size_t>(Atrip::rank), it})
              , a = el % Nv
              , b = el / Nv
              ;

      slice_into_vector<F>( this->sources[it]
                        , to,   {0, 0},       {No, No}
                        , from, {a, b, 0, 0}, {a+1, b+1, No, No}
                        );


    }

  };

}

Equations

This section presents on of the main physical contents of the program, the equations used.

The equations are of two types, energy equations and intermediate tensor contractions.

Prolog

#pragma once

#include<atrip/Atrip.hpp>
#include<atrip/Blas.hpp>
#include<atrip/Utils.hpp>

#if defined(HAVE_CUDA)
#include<thrust/device_vector.h>
#endif


namespace atrip {
using ABCTuple = std::array<size_t, 3>;
using PartialTuple = std::array<size_t, 2>;
using ABCTuples = std::vector<ABCTuple>;
#include<atrip/Equations.hpp>

#if defined(HAVE_CUDA)
#include <cuda.h>
#endif

namespace atrip {

Energy

For the energy we have two types of functions, a function for when the tuples \( (a,b,c) \) are distinct and where they are the same, they are accordingly called:

template <typename F=double>
double get_energy_distinct
  ( F const epsabc
  , size_t const No
  , F* const epsi
  , F* const Tijk
  , F* const Zijk
  );

template <typename F=double>
double get_energy_same
  ( F const epsabc
  , size_t const No
  , F* const epsi
  , F* const Tijk
  , F* const Zijk
  );

Their implementations follow, and they are written in a cache-friendly style for CPU architectures through the block_size variable.

template <typename F=double>
double get_energy_distinct
  ( F const epsabc
  , size_t const No
  , F* const epsi
  , F* const Tijk
  , F* const Zijk
  ) {
  constexpr size_t block_size=16;
  F energy(0.);
  for (size_t kk=0; kk<No; kk+=block_size){
    const size_t kend( std::min(No, kk+block_size) );
    for (size_t jj(kk); jj<No; jj+=block_size){
      const size_t jend( std::min( No, jj+block_size) );
      for (size_t ii(jj); ii<No; ii+=block_size){
        const size_t iend( std::min( No, ii+block_size) );
        for (size_t k(kk); k < kend; k++){
          const F ek(epsi[k]);
          const size_t jstart = jj > k ? jj : k;
          for (size_t j(jstart); j < jend; j++){
            F const ej(epsi[j]);
            F const facjk = j == k ? F(0.5) : F(1.0);
            size_t istart = ii > j ? ii : j;
            for (size_t i(istart); i < iend; i++){
              const F
                  ei(epsi[i])
                , facij = i == j ? F(0.5) : F(1.0)
                , denominator(epsabc - ei - ej - ek)
                , U(Zijk[i + No*j + No*No*k])
                , V(Zijk[i + No*k + No*No*j])
                , W(Zijk[j + No*i + No*No*k])
                , X(Zijk[j + No*k + No*No*i])
                , Y(Zijk[k + No*i + No*No*j])
                , Z(Zijk[k + No*j + No*No*i])
                , A(maybe_conjugate<F>(Tijk[i + No*j + No*No*k]))
                , B(maybe_conjugate<F>(Tijk[i + No*k + No*No*j]))
                , C(maybe_conjugate<F>(Tijk[j + No*i + No*No*k]))
                , D(maybe_conjugate<F>(Tijk[j + No*k + No*No*i]))
                , E(maybe_conjugate<F>(Tijk[k + No*i + No*No*j]))
                , _F(maybe_conjugate<F>(Tijk[k + No*j + No*No*i]))
                , value
                  = 3.0 * ( A * U
                            + B * V
                            + C * W
                            + D * X
                            + E * Y
                            + _F * Z )
                 + ( ( U + X + Y )
                   - 2.0 * ( V + W + Z )
                   ) * ( A + D + E )
                 + ( ( V + W + Z )
                   - 2.0 * ( U + X + Y )
                   ) * ( B + C + _F )
                ;
              energy += 2.0 * value / denominator * facjk * facij;
            } // i
          } // j
        } // k
      } // ii
    } // jj
  } // kk
  return std::real(energy);
}


template <typename F=double>
double get_energy_same
  ( F const epsabc
  , size_t const No
  , F* const epsi
  , F* const Tijk
  , F* const Zijk
  ) {
  constexpr size_t block_size = 16;
  F energy = F(0.);
  for (size_t kk=0; kk<No; kk+=block_size){
    const size_t kend( std::min( kk+block_size, No) );
    for (size_t jj(kk); jj<No; jj+=block_size){
      const size_t jend( std::min( jj+block_size, No) );
      for (size_t ii(jj); ii<No; ii+=block_size){
        const size_t iend( std::min( ii+block_size, No) );
        for (size_t k(kk); k < kend; k++){
          const F ek(epsi[k]);
          const size_t jstart = jj > k ? jj : k;
          for(size_t j(jstart); j < jend; j++){
            const F facjk( j == k ? F(0.5) : F(1.0));
            const F ej(epsi[j]);
            const size_t istart = ii > j ? ii : j;
            for(size_t i(istart); i < iend; i++){
              const F
                ei(epsi[i])
              , facij ( i==j ? F(0.5) : F(1.0))
              , denominator(epsabc - ei - ej - ek)
              , U(Zijk[i + No*j + No*No*k])
              , V(Zijk[j + No*k + No*No*i])
              , W(Zijk[k + No*i + No*No*j])
              , A(maybe_conjugate<F>(Tijk[i + No*j + No*No*k]))
              , B(maybe_conjugate<F>(Tijk[j + No*k + No*No*i]))
              , C(maybe_conjugate<F>(Tijk[k + No*i + No*No*j]))
              , value
                = F(3.0) * ( A * U
                           + B * V
                           + C * W
                           )
                - ( A + B + C ) * ( U + V + W )
              ;
              energy += F(2.0) * value / denominator * facjk * facij;
            } // i
          } // j
        } // k
      } // ii
    } // jj
  } // kk
  return std::real(energy);
}

And we explicitly instantiate these templated functions:

// instantiate double
template
double get_energy_distinct
  ( double const epsabc
  , size_t const No
  , double* const epsi
  , double* const Tijk
  , double* const Zijk
  );

template
double get_energy_same
  ( double const epsabc
  , size_t const No
  , double* const epsi
  , double* const Tijk
  , double* const Zijk
  );

// instantiate Complex
template
double get_energy_distinct
  ( Complex const epsabc
  , size_t const No
  , Complex* const epsi
  , Complex* const Tijk
  , Complex* const Zijk
  );

template
double get_energy_same
  ( Complex const epsabc
  , size_t const No
  , Complex* const epsi
  , Complex* const Tijk
  , Complex* const Zijk
  );

Contractions

Singles contribution

The first contraction we discuss is the one involving \( t_i^a \) singles amplitudes.

For every \( (a,b,c) \) we construct the object

\begin{align*} Zijk &= t^a_i Vbcjk
& + t^b_j Vacik \ & + t^c_k Vabij \end{align*}

and for it we will need the corresponding slices of the \( Vppph \) integrals, i.e., AB, AC and BC.

  template <typename F=double>
  void singles_contribution
    ( size_t No
    , size_t Nv
    , const ABCTuple &abc
    , F* const Tph
    , F* const VABij
    , F* const VACij
    , F* const VBCij
    , F* Zijk
    ) {
    const size_t a(abc[0]), b(abc[1]), c(abc[2]);
    const size_t NoNo = No*No;
    // TODO: change order of for loops
    for (size_t k = 0; k < No; k++)
    for (size_t i = 0; i < No; i++)
    for (size_t j = 0; j < No; j++) {
      const size_t ijk = i + j*No + k*NoNo;
      Zijk[ijk] += Tph[ a + i * Nv ] * VBCij[ j + k * No ];
      Zijk[ijk] += Tph[ b + j * Nv ] * VACij[ i + k * No ];
      Zijk[ijk] += Tph[ c + k * Nv ] * VABij[ i + j * No ];
    }
  }

// instantiate
  template void singles_contribution( size_t No
                                   , size_t Nv
                                   , const ABCTuple &abc
                                   , double* const Tph
                                   , double* const VABij
                                   , double* const VACij
                                   , double* const VBCij
                                   , double* Zijk
                                   );

  template void singles_contribution( size_t No
                                   , size_t Nv
                                   , const ABCTuple &abc
                                   , Complex* const Tph
                                   , Complex* const VABij
                                   , Complex* const VACij
                                   , Complex* const VBCij
                                   , Complex* Zijk
                                   );

Doubles contribution

Next we build the tensor \( Tijk \) from the contraction of the coulomb integrals with the doubles amplitudes \( tabij \). There are two kinds of contractions, which are differently challenging in terms of computation.

\begin{align*} Tijk &=
{\color{redL} = 0}N_\mathrm{o} &- T{\color{blue ab}}{\color{redL}j} Vik{\color{redL}{\color{blue}c}} \ &- T{\color{blue ab}}i{\color{redL}} Vjk{\color{redL}{\color{blue}c}} \ &- T{\color{blue ac}}{\color{redL}k} Vij{\color{redL}{\color{blue}b}} \ &- T{\color{blue ac}}i{\color{redL}} Vkj{\color{redL}{\color{blue}b}} \ &- T{\color{blue bc}}{\color{redL}k} Vji{\color{redL}{\color{blue}a}} \ &- T{\color{blue bc}}j{\color{redL}} Vki{\color{redL}{\color{blue}a}} \end{align*}

\begin{align*} Tijk &=
{\color{rede} = 0}N_\mathrm{v} &\phantom{+} V{\color{blueab}}{\color{rede}i} T{\color{bluec}{\color{red}e}}ij \ &+ V{\color{blueba}}{\color{rede}j} T{\color{bluec}{\color{red}e}}ji \ &+ V{\color{bluecb}}{\color{rede}k} T{\color{bluea}{\color{red}e}}kj \ &+ V{\color{blueac}}{\color{rede}i} T{\color{blueb}{\color{red}e}}ik \ &+ V{\color{bluebc}}{\color{rede}j} T{\color{bluea}{\color{red}e}}jk \ &+ V{\color{blueca}}{\color{rede}k} T{\color{blueb}{\color{red}e}}ki \ \end{align*}

#if defined(HAVE_CUDA)
  __device__
#endif
  template <typename F=double>
  void doubles_contribution
    ( const ABCTuple &abc
    , size_t const No
    , size_t const Nv
    // -- VABCI
    , DataPtr<F> const VABph
    , DataPtr<F> const VACph
    , DataPtr<F> const VBCph
    , DataPtr<F> const VBAph
    , DataPtr<F> const VCAph
    , DataPtr<F> const VCBph
    // -- VHHHA
    , DataPtr<F> const VhhhA
    , DataPtr<F> const VhhhB
    , DataPtr<F> const VhhhC
    // -- TA
    , DataPtr<F> const TAphh
    , DataPtr<F> const TBphh
    , DataPtr<F> const TCphh
    // -- TABIJ
    , DataPtr<F> const TABhh
    , DataPtr<F> const TAChh
    , DataPtr<F> const TBChh
    // -- TIJK
    // , DataPtr<F> Tijk
    , DataFieldType<F>* Tijk_
    );
#if defined(HAVE_CUDA)
  __device__
#endif
  template <typename F=double>
  void doubles_contribution
    ( const ABCTuple &abc
    , size_t const No
    , size_t const Nv
    // -- VABCI
    , DataPtr<F> const VABph
    , DataPtr<F> const VACph
    , DataPtr<F> const VBCph
    , DataPtr<F> const VBAph
    , DataPtr<F> const VCAph
    , DataPtr<F> const VCBph
    // -- VHHHA
    , DataPtr<F> const VhhhA
    , DataPtr<F> const VhhhB
    , DataPtr<F> const VhhhC
    // -- TA
    , DataPtr<F> const TAphh
    , DataPtr<F> const TBphh
    , DataPtr<F> const TCphh
    // -- TABIJ
    , DataPtr<F> const TABhh
    , DataPtr<F> const TAChh
    , DataPtr<F> const TBChh
    // -- TIJK
    // , DataPtr<F> Tijk_
    , DataFieldType<F>* Tijk_
    ) {

    const size_t a = abc[0], b = abc[1], c = abc[2]
              , NoNo = No*No, NoNv = No*Nv
              ;

    typename DataField<F>::type* Tijk = (typename DataField<F>::type*) Tijk_;

#if defined(ATRIP_USE_DGEMM)
#define _IJK_(i, j, k) i + j*No + k*NoNo
#define REORDER(__II, __JJ, __KK)                                   \
    WITH_CHRONO("doubles:reorder",                                  \
    for (size_t k = 0; k < No; k++)                                 \
    for (size_t j = 0; j < No; j++)                                 \
    for (size_t i = 0; i < No; i++) {                               \
      Tijk[_IJK_(i, j, k)] += _t_buffer_p[_IJK_(__II, __JJ, __KK)];   \
    }                                                               \
    )
#if defined(HAVE_CUDA)
#define __TO_DEVICEPTR(_v)                      \
    ((DataFieldType<F>*)                        \
     (CUdeviceptr)                              \
     thrust::raw_pointer_cast((_v)))
#define DGEMM_PARTICLES(__A, __B)                   \
  atrip::xgemm<F>("T",                              \
                  "N",                              \
                  (int const*)&NoNo,                \
                  (int const*)&No,                  \
                  (int const*)&Nv,                  \
                  &one,                             \
                  (DataFieldType<F>*)__A,           \
                  (int const*)&Nv,                  \
                  (DataFieldType<F>*)__B,           \
                  (int const*)&Nv,                  \
                  &zero,                            \
                  _t_buffer_p,                      \
                  (int const*)&NoNo);
#define DGEMM_HOLES(__A, __B, __TRANSB)             \
  atrip::xgemm<F>("N",                              \
                  __TRANSB,                         \
                  (int const*)&NoNo,                \
                  (int const*)&No,                  \
                  (int const*)&No,                  \
                  &m_one,                           \
                  __TO_DEVICEPTR(__A),              \
                  (int const*)&NoNo,                \
                  (DataFieldType<F>*)__B,           \
                  (int const*)&No,                  \
                  &zero,                            \
                  _t_buffer_p,                      \
                  (int const*)&NoNo                 \
                  );
#define MAYBE_CONJ(_conj, _buffer)                                      \
  for (size_t __i = 0; __i < NoNoNo; ++__i)                             \
    __TO_DEVICEPTR(_conj.data())[__i] =                                 \
      maybe_conjugate<DataFieldType<F>>(((DataFieldType<F>*)(_buffer))[__i]);
#else
#define __TO_DEVICEPTR(_v) (_v)
#define DGEMM_PARTICLES(__A, __B)               \
  atrip::xgemm<F>("T",                          \
                  "N",                          \
                  (int const*)&NoNo,            \
                  (int const*)&No,              \
                  (int const*)&Nv,              \
                  &one,                         \
                  __A,                          \
                  (int const*)&Nv,              \
                  __B,                          \
                  (int const*)&Nv,              \
                  &zero,                        \
                  _t_buffer_p,                  \
                  (int const*)&NoNo             \
                  );
#define DGEMM_HOLES(__A, __B, __TRANSB)         \
  atrip::xgemm<F>("N",                          \
                  __TRANSB,                     \
                  (int const*)&NoNo,            \
                  (int const*)&No,              \
                  (int const*)&No,              \
                  &m_one,                       \
                  __A,                          \
                  (int const*)&NoNo,            \
                  __B,                          \
                  (int const*)&No,              \
                  &zero,                        \
                  _t_buffer_p,                  \
                  (int const*)&NoNo             \
                  );
#define MAYBE_CONJ(_conj, _buffer)                \
  for (size_t __i = 0; __i < NoNoNo; ++__i)       \
    _conj[__i] = maybe_conjugate<F>(_buffer[__i]);
#endif

    const size_t NoNoNo = No*NoNo;
#ifdef HAVE_CUDA
    thrust::device_vector< DataFieldType<F> > _t_buffer;
#else
    std::vector<F> _t_buffer;
#endif
    _t_buffer.reserve(NoNoNo);
    DataFieldType<F>* _t_buffer_p = __TO_DEVICEPTR(_t_buffer.data());
    F one{1.0}, m_one{-1.0}, zero{0.0};

    WITH_CHRONO("double:reorder",
      for (size_t k = 0; k < NoNoNo; k++) {
        Tijk[k] = DataFieldType<F>{0.0};
       })

    // TOMERGE: replace chronos
    WITH_CHRONO("doubles:holes",
      { // Holes part %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

#ifdef HAVE_CUDA
        thrust::device_vector< DataFieldType<F> > _vhhh;
        _vhhh.reserve(NoNoNo);
#else
        std::vector<F> _vhhh(NoNoNo);
#endif

        // VhhhC[i + k*No + L*NoNo] * TABhh[L + j*No]; H1
        MAYBE_CONJ(_vhhh, VhhhC)
        WITH_CHRONO("doubles:holes:1",
          DGEMM_HOLES(_vhhh.data(), TABhh, "N")
          REORDER(i, k, j)
        )
        // VhhhC[j + k*No + L*NoNo] * TABhh[i + L*No]; H0
        WITH_CHRONO("doubles:holes:2",
          DGEMM_HOLES(_vhhh.data(), TABhh, "T")
          REORDER(j, k, i)
        )

        // VhhhB[i + j*No + L*NoNo] * TAChh[L + k*No]; H5
        MAYBE_CONJ(_vhhh, VhhhB)
        WITH_CHRONO("doubles:holes:3",
          DGEMM_HOLES(_vhhh.data(), TAChh, "N")
          REORDER(i, j, k)
        )
        // VhhhB[k + j*No + L*NoNo] * TAChh[i + L*No]; H3
        WITH_CHRONO("doubles:holes:4",
          DGEMM_HOLES(_vhhh.data(), TAChh, "T")
          REORDER(k, j, i)
        )

        // VhhhA[j + i*No + L*NoNo] * TBChh[L + k*No]; H1
        MAYBE_CONJ(_vhhh, VhhhA)
        WITH_CHRONO("doubles:holes:5",
          DGEMM_HOLES(_vhhh.data(), TBChh, "N")
          REORDER(j, i, k)
        )
        // VhhhA[k + i*No + L*NoNo] * TBChh[j + L*No]; H4
        WITH_CHRONO("doubles:holes:6",
          DGEMM_HOLES(_vhhh.data(), TBChh, "T")
          REORDER(k, i, j)
        )

      }
    )
  #undef MAYBE_CONJ

    WITH_CHRONO("doubles:particles",
      { // Particle part %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
        // TAphh[E + i*Nv + j*NoNv] * VBCph[E + k*Nv]; P0
        WITH_CHRONO("doubles:particles:1",
          DGEMM_PARTICLES(TAphh, VBCph)
          REORDER(i, j, k)
        )
        // TAphh[E + i*Nv + k*NoNv] * VCBph[E + j*Nv]; P3
        WITH_CHRONO("doubles:particles:2",
          DGEMM_PARTICLES(TAphh, VCBph)
          REORDER(i, k, j)
        )
        // TCphh[E + k*Nv + i*NoNv] * VABph[E + j*Nv]; P5
        WITH_CHRONO("doubles:particles:3",
          DGEMM_PARTICLES(TCphh, VABph)
          REORDER(k, i, j)
        )
        // TCphh[E + k*Nv + j*NoNv] * VBAph[E + i*Nv]; P2
        WITH_CHRONO("doubles:particles:4",
          DGEMM_PARTICLES(TCphh, VBAph)
          REORDER(k, j, i)
        )
        // TBphh[E + j*Nv + i*NoNv] * VACph[E + k*Nv]; P1
        WITH_CHRONO("doubles:particles:5",
          DGEMM_PARTICLES(TBphh, VACph)
          REORDER(j, i, k)
        )
        // TBphh[E + j*Nv + k*NoNv] * VCAph[E + i*Nv]; P4
        WITH_CHRONO("doubles:particles:6",
          DGEMM_PARTICLES(TBphh, VCAph)
          REORDER(j, k, i)
        )
      }
    )

  #undef REORDER
  #undef DGEMM_HOLES
  #undef DGEMM_PARTICLES
  #undef _IJK_
  #else
    for (size_t k = 0; k < No; k++)
    for (size_t j = 0; j < No; j++)
    for (size_t i = 0; i < No; i++){
      const size_t ijk = i + j*No + k*NoNo
                ,  jk = j + k*No
                ;
      Tijk[ijk] = 0.0; // :important
      // HOLE DIAGRAMS: TABHH and VHHHA
      for (size_t L = 0; L < No; L++){
        // t[abLj] * V[Lcik]        H1
        // t[baLi] * V[Lcjk]        H0      TODO: conjugate T for complex
        Tijk[ijk] -= TABhh[L + j*No] * VhhhC[i + k*No + L*NoNo];
        Tijk[ijk] -= TABhh[i + L*No] * VhhhC[j + k*No + L*NoNo];

        // t[acLk] * V[Lbij]        H5
        // t[caLi] * V[Lbkj]        H3
        Tijk[ijk] -= TAChh[L + k*No] * VhhhB[i + j*No + L*NoNo];
        Tijk[ijk] -= TAChh[i + L*No] * VhhhB[k + j*No + L*NoNo];

        // t[bcLk] * V[Laji]        H2
        // t[cbLj] * V[Laki]        H4
        Tijk[ijk] -= TBChh[L + k*No] * VhhhA[j + i*No + L*NoNo];
        Tijk[ijk] -= TBChh[j + L*No] * VhhhA[k + i*No + L*NoNo];
      }
      // PARTILCE DIAGRAMS: TAPHH and VABPH
      for (size_t E = 0; E < Nv; E++) {
        // t[aEij] * V[bcEk]        P0
        // t[aEik] * V[cbEj]        P3 // TODO: CHECK THIS ONE, I DONT KNOW
        Tijk[ijk] += TAphh[E + i*Nv + j*NoNv] * VBCph[E + k*Nv];
        Tijk[ijk] += TAphh[E + i*Nv + k*NoNv] * VCBph[E + j*Nv];

        // t[cEki] * V[abEj]        P5
        // t[cEkj] * V[baEi]        P2
        Tijk[ijk] += TCphh[E + k*Nv + i*NoNv] * VABph[E + j*Nv];
        Tijk[ijk] += TCphh[E + k*Nv + j*NoNv] * VBAph[E + i*Nv];

        // t[bEji] * V[acEk]        P1
        // t[bEjk] * V[caEi]        P4
        Tijk[ijk] += TBphh[E + j*Nv + i*NoNv] * VACph[E + k*Nv];
        Tijk[ijk] += TBphh[E + j*Nv + k*NoNv] * VCAph[E + i*Nv];
      }

    }
#endif
  }



  // instantiate templates
#if defined(HAVE_CUDA)
  __device__
#endif
  template
  void doubles_contribution<double>
    ( const ABCTuple &abc
    , size_t const No
    , size_t const Nv
    // -- VABCI
    , DataPtr<double> const VABph
    , DataPtr<double> const VACph
    , DataPtr<double> const VBCph
    , DataPtr<double> const VBAph
    , DataPtr<double> const VCAph
    , DataPtr<double> const VCBph
    // -- VHHHA
    , DataPtr<double> const VhhhA
    , DataPtr<double> const VhhhB
    , DataPtr<double> const VhhhC
    // -- TA
    , DataPtr<double> const TAphh
    , DataPtr<double> const TBphh
    , DataPtr<double> const TCphh
    // -- TABIJ
    , DataPtr<double> const TABhh
    , DataPtr<double> const TAChh
    , DataPtr<double> const TBChh
    // -- TIJK
    , DataFieldType<double>* Tijk
    );

#if defined(HAVE_CUDA)
  __device__
#endif
  template
  void doubles_contribution<Complex>
    ( const ABCTuple &abc
    , size_t const No
    , size_t const Nv
    // -- VABCI
    , DataPtr<Complex> const VABph
    , DataPtr<Complex> const VACph
    , DataPtr<Complex> const VBCph
    , DataPtr<Complex> const VBAph
    , DataPtr<Complex> const VCAph
    , DataPtr<Complex> const VCBph
    // -- VHHHA
    , DataPtr<Complex> const VhhhA
    , DataPtr<Complex> const VhhhB
    , DataPtr<Complex> const VhhhC
    // -- TA
    , DataPtr<Complex> const TAphh
    , DataPtr<Complex> const TBphh
    , DataPtr<Complex> const TCphh
    // -- TABIJ
    , DataPtr<Complex> const TABhh
    , DataPtr<Complex> const TAChh
    , DataPtr<Complex> const TBChh
    // -- TIJK
    , DataFieldType<Complex>* Tijk
    );

Epilog

}
}

Blas

The main matrix-matrix multiplication method used in this algorithm is mainly using the DGEMM function, which we declare as extern since it should be known only at link-time.

#include <atrip/Blas.hpp>
#include <atrip/Atrip.hpp>

#if defined(HAVE_CUDA)
#  include <cstring>

  static inline
  cublasOperation_t char_to_cublasOperation(const char* trans) {
    if (strncmp("N", trans, 1) == 0)
      return CUBLAS_OP_N;
    else if (strncmp("T", trans, 1) == 0)
      return CUBLAS_OP_T;
    else
      return CUBLAS_OP_C;
  }

#endif

namespace atrip {


  template <>
  void xgemm<double>(const char *transa,
             const char *transb,
             const int *m,
             const int *n,
             const int *k,
             double *alpha,
             const typename DataField<double>::type *A,
             const int *lda,
             const typename DataField<double>::type *B,
             const int *ldb,
             double *beta,
             typename DataField<double>::type *C,
             const int *ldc) {
#if defined(HAVE_CUDA)
    cublasDgemm(Atrip::cuda.handle,
                char_to_cublasOperation(transa),
                char_to_cublasOperation(transb),
                *m, *n, *k,
                alpha, A, *lda,
                B, *ldb, beta,
                C, *ldc);
#else
    dgemm_(transa, transb,
           m, n, k,
           alpha, A, lda,
           B, ldb, beta,
           C, ldc);
#endif
  }

  template <>
  void xgemm<Complex>(const char *transa,
             const char *transb,
             const int *m,
             const int *n,
             const int *k,
             Complex *alpha,
             const typename DataField<Complex>::type *A,
             const int *lda,
             const typename DataField<Complex>::type *B,
             const int *ldb,
             Complex *beta,
             typename DataField<Complex>::type *C,
             const int *ldc) {
#if defined(HAVE_CUDA)
#pragma warning HAVE_CUDA
    cuDoubleComplex
      cu_alpha = {std::real(*alpha), std::imag(*alpha)},
      cu_beta = {std::real(*beta), std::imag(*beta)};
    cublasZgemm(Atrip::cuda.handle,
                char_to_cublasOperation(transa),
                char_to_cublasOperation(transb),
                *m, *n, *k,
                &cu_alpha,

                A, *lda,
                B, *ldb,
                &cu_beta,
                C, *ldc);
#else
    zgemm_(transa, transb,
           m, n, k,
           alpha, A, lda,
           B, ldb, beta,
           C, ldc);
#endif
  }

}

Atrip

Header

#pragma once
#include <sstream>
#include <string>
#include <map>
#include <mpi.h>
#include "config.h"

#if defined(HAVE_CUDA)
#include <cuda.h>
#define CUBLASAPI
#include <cublas_api.h>
#include <cublas_v2.h>
#endif

#include <atrip/Utils.hpp>
#include <atrip/Types.hpp>

#define ADD_ATTRIBUTE(_type, _name, _default)   \
  _type _name = _default;                       \
  Input& with_ ## _name(_type i) {              \
    _name = i;                                  \
    return *this;                               \
  }

namespace atrip {

  struct Atrip {

    static size_t rank;
    static size_t np;
    static MPI_Comm communicator;
    static Timings chrono;
#if defined(HAVE_CUDA)
    struct CudaContext {
      cublasStatus_t status;
      cublasHandle_t handle;
    };
    static CudaContext cuda;
#endif

    static void init(MPI_Comm);


    template <typename F=double>
    struct Input {
      CTF::Tensor<F> *ei = nullptr
                    *ea = nullptr
                    *Tph = nullptr
                    *Tpphh = nullptr
                    *Vpphh = nullptr
                    *Vhhhp = nullptr
                    *Vppph = nullptr
                   ;
      Input& with_epsilon_i(CTF::Tensor<F> * t) { ei = t; return *this; }
      Input& with_epsilon_a(CTF::Tensor<F> * t) { ea = t; return *this; }
      Input& with_Tai(CTF::Tensor<F> * t) { Tph = t; return *this; }
      Input& with_Tabij(CTF::Tensor<F> * t) { Tpphh = t; return *this; }
      Input& with_Vabij(CTF::Tensor<F> * t) { Vpphh = t; return *this; }
      Input& with_Vijka(CTF::Tensor<F> * t) { Vhhhp = t; return *this; }
      Input& with_Vabci(CTF::Tensor<F> * t) { Vppph = t; return *this; }

      enum TuplesDistribution {
        NAIVE,
        GROUP_AND_SORT,
      };

      ADD_ATTRIBUTE(bool, delete_Vppph, false)
      ADD_ATTRIBUTE(bool, rank_round_robin, false)
      ADD_ATTRIBUTE(bool, chrono, false)
      ADD_ATTRIBUTE(bool, barrier, false)
      ADD_ATTRIBUTE(int, max_iterations, 0)
      ADD_ATTRIBUTE(int, iteration_mod, -1)
      ADD_ATTRIBUTE(int, percentage_mod, -1)
      ADD_ATTRIBUTE(TuplesDistribution, tuples_distribution, NAIVE)
      ADD_ATTRIBUTE(std::string, checkpoint_path, "atrip-checkpoint.yaml")
      ADD_ATTRIBUTE(bool, read_checkpoint_if_exists, true)
      ADD_ATTRIBUTE(bool, writeCheckpoint, true)
      ADD_ATTRIBUTE(float, checkpoint_at_percentage, 10)
      ADD_ATTRIBUTE(size_t, checkpoint_at_every_iteration, 0)

    };

    struct Output {
      double energy;
    };
    template <typename F=double>
    static Output run(Input<F> const& in);
  };

}

#undef ADD_ATTRIBUTE

Main

#include <iomanip>

#include <atrip/Atrip.hpp>
#include <atrip/Utils.hpp>
#include <atrip/Equations.hpp>
#include <atrip/SliceUnion.hpp>
#include <atrip/Unions.hpp>
#include <atrip/Checkpoint.hpp>

using namespace atrip;

template <typename F> bool RankMap<F>::RANK_ROUND_ROBIN;
template bool RankMap<double>::RANK_ROUND_ROBIN;
template bool RankMap<Complex>::RANK_ROUND_ROBIN;
size_t Atrip::rank;
size_t Atrip::np;
#if defined(HAVE_CUDA)
typename Atrip::CudaContext Atrip::cuda;
#endif
MPI_Comm Atrip::communicator;
Timings Atrip::chrono;

// user printing block
IterationDescriptor IterationDescription::descriptor;
void atrip::register_iteration_descriptor(IterationDescriptor d) {
  IterationDescription::descriptor = d;
}

void Atrip::init(MPI_Comm world)  {
  Atrip::communicator = world;
  MPI_Comm_rank(world, (int*)&Atrip::rank);
  MPI_Comm_size(world, (int*)&Atrip::np);

#if defined(HAVE_CUDA)
  Atrip::cuda.status = cublasCreate(&Atrip::cuda.handle);
#endif

}

template <typename F>
Atrip::Output Atrip::run(Atrip::Input<F> const& in) {

  const size_t np = Atrip::np;
  const size_t rank = Atrip::rank;
  MPI_Comm universe = Atrip::communicator;

  const size_t No = in.ei->lens[0];
  const size_t Nv = in.ea->lens[0];
  LOG(0,"Atrip") << "No: " << No << "\n";
  LOG(0,"Atrip") << "Nv: " << Nv << "\n";
  LOG(0,"Atrip") << "np: " << np << "\n";

  // allocate the three scratches, see piecuch
  // we need local copies of the following tensors on every
  // rank
  std::vector<F> _epsi(No)
               , _epsa(Nv)
               , _Tai(No * Nv)
               ;

  in.ei->read_all(_epsi.data());
  in.ea->read_all(_epsa.data());
  in.Tph->read_all(_Tai.data());

#if defined(HAVE_CUDA)
  DataPtr<F> Tai, epsi, epsa;
  cuMemAlloc(&Tai, sizeof(F) * _Tai.size());
  cuMemAlloc(&epsi, sizeof(F) * _epsi.size());
  cuMemAlloc(&epsa, sizeof(F) * _epsa.size());

  cuMemcpyHtoD(Tai, (void*)_Tai.data(), sizeof(F) * _Tai.size());
  cuMemcpyHtoD(epsi,(void*)_epsi.data(), sizeof(F) * _epsi.size());
  cuMemcpyHtoD(epsa, (void*)_epsa.data(), sizeof(F) * _epsa.size());

  DataPtr<F> Tijk, Zijk;
  cuMemAlloc(&Tijk, sizeof(F) * No * No * No);
  cuMemAlloc(&Zijk, sizeof(F) * No * No * No);
#else
  std::vector<F> &Tai = _Tai, &epsi = _epsi, &epsa = _epsa;
  std::vector<F> Tijk(No*No*No), Zijk(No*No*No);
#endif

  RankMap<F>::RANK_ROUND_ROBIN = in.rank_round_robin;
  if (RankMap<F>::RANK_ROUND_ROBIN) {
    LOG(0,"Atrip") << "Doing rank round robin slices distribution\n";
  } else {
    LOG(0,"Atrip")
      << "Doing node > local rank round robin slices distribution\n";
  }


  // COMMUNICATOR CONSTRUCTION ========================================={{{1
  //
  // Construct a new communicator living only on a single rank
  int child_size = 1
    , child_rank
    ;
  const
  int color = rank / child_size
    , crank = rank % child_size
    ;
  MPI_Comm child_comm;
  if (np == 1) {
    child_comm = universe;
  } else {
    MPI_Comm_split(universe, color, crank, &child_comm);
    MPI_Comm_rank(child_comm, &child_rank);
    MPI_Comm_size(child_comm, &child_size);
  }

  // BUILD SLICES PARAMETRIZED BY NV x NV =============================={{{1
  WITH_CHRONO("nv-nv-slices",
    LOG(0,"Atrip") << "BUILD NV x NV-SLICES\n";
    ABPH<F> abph(*in.Vppph, (size_t)No, (size_t)Nv, (size_t)np, child_comm, universe);
    ABHH<F> abhh(*in.Vpphh, (size_t)No, (size_t)Nv, (size_t)np, child_comm, universe);
    TABHH<F> tabhh(*in.Tpphh, (size_t)No, (size_t)Nv, (size_t)np, child_comm, universe);
  )

  // delete the Vppph so that we don't have a HWM situation for the NV slices
  if (in.delete_Vppph) {
    delete in.Vppph;
  }

  // BUILD SLICES PARAMETRIZED BY NV ==================================={{{1
  WITH_CHRONO("nv-slices",
    LOG(0,"Atrip") << "BUILD NV-SLICES\n";
    TAPHH<F> taphh(*in.Tpphh, (size_t)No, (size_t)Nv, (size_t)np, child_comm, universe);
    HHHA<F>  hhha(*in.Vhhhp, (size_t)No, (size_t)Nv, (size_t)np, child_comm, universe);
  )

  // all tensors
  std::vector< SliceUnion<F>* > unions = {&taphh, &hhha, &abph, &abhh, &tabhh};

  // get tuples for the current rank
  TuplesDistribution *distribution;

  if (in.tuples_distribution == Atrip::Input<F>::TuplesDistribution::NAIVE) {
    LOG(0,"Atrip") << "Using the naive distribution\n";
    distribution = new NaiveDistribution();
  } else {
    LOG(0,"Atrip") << "Using the group-and-sort distribution\n";
    distribution = new group_and_sort::Distribution();
  }

  LOG(0,"Atrip") << "BUILDING TUPLE LIST\n";
  WITH_CHRONO("tuples:build",
    auto const tuples_list = distribution->get_tuples(Nv, universe);
    )
  const size_t n_iterations = tuples_list.size();
  {
    LOG(0,"Atrip") << "#iterations: "
                  << n_iterations
                  << "/"
                  << n_iterations * np
                  << "\n";
  }

  const size_t
      iteration_mod = (in.percentage_mod > 0)
                  ? n_iterations * in.percentage_mod / 100.0
                  : in.iteration_mod

    , iteration1Percent = n_iterations * 0.01
    ;



  auto const is_fake_tuple
    = [&tuples_list, distribution](size_t const i) {
      return distribution->tuple_is_fake(tuples_list[i]);
    };


  using Database = typename Slice<F>::Database;
  auto communicate_database
    = [ &unions
      , np
      ] (ABCTuple const& abc, MPI_Comm const& c) -> Database {

        WITH_CHRONO("db:comm:type:do",
          auto MPI_LDB_ELEMENT = Slice<F>::mpi::local_database_element();
        )

        WITH_CHRONO("db:comm:ldb",
          typename Slice<F>::LocalDatabase ldb;
          for (auto const& tensor: unions) {
            auto const& tensor_db = tensor->build_local_database(abc);
            ldb.insert(ldb.end(), tensor_db.begin(), tensor_db.end());
          }
        )

        Database db(np * ldb.size(), ldb[0]);

        WITH_CHRONO("oneshot-db:comm:allgather",
        WITH_CHRONO("db:comm:allgather",
          MPI_Allgather( ldb.data()
                       , ldb.size()
                       , MPI_LDB_ELEMENT
                       , db.data()
                       , ldb.size()
                       , MPI_LDB_ELEMENT
                       , c);
        ))

        WITH_CHRONO("db:comm:type:free",
          MPI_Type_free(&MPI_LDB_ELEMENT);
        )

        return db;
      };

  auto do_io_phase
    = [&unions, &rank, &np, &universe] (Database const& db) {

    const size_t localDBLength = db.size() / np;

    size_t send_tag = 0
         , recv_tag = rank * localDBLength
         ;

    // RECIEVE PHASE ======================================================
    {
      // At this point, we have already send to everyone that fits
      auto const& begin = &db[rank * localDBLength]
                , end   = begin + localDBLength
                ;
      for (auto it = begin; it != end; ++it) {
        recv_tag++;
        auto const& el = *it;
        auto& u = union_by_name(unions, el.name);

        WITH_DBG std::cout
          << rank << ":r"
          << "♯" << recv_tag << " =>"
          << " «n" << el.name
          << ", t" << el.info.type
          << ", s" << el.info.state
          << "»"
          << " ⊙ {" << rank << "⇐" << el.info.from.rank
                    << ", "
                    << el.info.from.source << "}"
          << " ∴ {" << el.info.tuple[0]
                    << ", "
                    << el.info.tuple[1]
                    << "}"
          << "\n"
          ;

        WITH_CHRONO("db:io:recv",
          u.receive(el.info, recv_tag);
        )

      } // recv
    }

    // SEND PHASE =========================================================
    for (size_t other_rank = 0; other_rank<np; other_rank++) {
      auto const& begin = &db[other_rank * localDBLength]
                , end = begin + localDBLength
                ;
      for (auto it = begin; it != end; ++it) {
        send_tag++;
        typename Slice<F>::LocalDatabaseElement const& el = *it;

        if (el.info.from.rank != rank) continue;

        auto& u = union_by_name(unions, el.name);
        WITH_DBG std::cout
          << rank << ":s"
          << "♯" << send_tag << " =>"
          << " «n" << el.name
          << ", t" << el.info.type
          << ", s" << el.info.state
          << "»"
          << " ⊙ {" << el.info.from.rank << "⇒" << other_rank
                    << ", "
                    << el.info.from.source << "}"
          << " ∴ {" << el.info.tuple[0]
                    << ", "
                    << el.info.tuple[1]
                    << "}"
          << "\n"
          ;

        WITH_CHRONO("db:io:send",
          u.send(other_rank, el, send_tag);
        )

      } // send phase

    } // other_rank


  };

#if defined(HAVE_OCD) || defined(ATRIP_PRINT_TUPLES)
  std::map<ABCTuple, double> tuple_energies;
#endif

  const double doubles_flops
    = double(No)
    * double(No)
    * double(No)
    * (double(No) + double(Nv))
    * 2.0
    * (traits::is_complex<F>() ? 2.0 : 1.0)
    * 6.0
    / 1e9
    ;

  // START MAIN LOOP ======================================================{{{1

  double energy(0.);
  size_t first_iteration = 0;
  Checkpoint c;
  const size_t checkpoint_mod
    = in.checkpoint_at_every_iteration != 0
    ? in.checkpoint_at_every_iteration
    : n_iterations * in.checkpoint_at_percentage / 100;
  if (in.read_checkpoint_if_exists) {
    std::ifstream fin(in.checkpoint_path);
    if (fin.is_open()) {
      LOG(0, "Atrip") <<  "Reading checkpoint from "
                      << in.checkpoint_path << "\n";
      c = read_checkpoint(fin);
      first_iteration = (size_t)c.iteration;
      if (first_iteration > n_iterations) {
        // TODO: throw an error here
        // first_iteration is bigger than n_iterations,
        // you probably started the program with a different number
        // of cores
      }
      if (No != c.no) {/* TODO: write warning */}
      if (Nv != c.nv) {/* TODO: write warning */}
      // TODO write warnings for nrank and so on
      if (Atrip::rank == 0) {
        // take the negative of the energy to correct for the
        // negativity of the equations, the energy in the checkpoint
        // should always be the correct physical one.
        energy = - (double)c.energy;
      }
      LOG(0, "Atrip") << "energy from checkpoint "
                      << energy << "\n";
      LOG(0, "Atrip") << "iteration from checkpoint "
                      << first_iteration << "\n";
    }
  }

  for ( size_t
          i = first_iteration,
          iteration = first_iteration + 1
      ; i < tuples_list.size()
      ; i++, iteration++
      ) {
    Atrip::chrono["iterations"].start();

    // check overhead from chrono over all iterations
    WITH_CHRONO("start:stop", {})

    // check overhead of doing a barrier at the beginning
    WITH_CHRONO("oneshot-mpi:barrier",
    WITH_CHRONO("mpi:barrier",
      if (in.barrier) MPI_Barrier(universe);
    ))

    // write checkpoints
    if (iteration % checkpoint_mod == 0) {
        double global_energy = 0;
        MPI_Reduce(&energy, &global_energy, 1, MPI_DOUBLE, MPI_SUM, 0, universe);
        Checkpoint out
          = {No,
             Nv,
             0, // TODO
             0, // TODO
             - global_energy,
             iteration - 1,
             in.rank_round_robin};
        LOG(0, "Atrip") << "Writing checkpoint\n";
        if (Atrip::rank == 0) write_checkpoint(out, in.checkpoint_path);
    }

    // write reporting
    if (iteration % iteration_mod == 0 || iteration == iteration1Percent) {

      if (IterationDescription::descriptor) {
        IterationDescription::descriptor({
          iteration,
          n_iterations,
          Atrip::chrono["iterations"].count()
        });
      }

      LOG(0,"Atrip")
        << "iteration " << iteration
        << " [" << 100 * iteration / n_iterations << "%]"
        << " (" << doubles_flops * iteration / Atrip::chrono["doubles"].count()
        << "GF)"
        << " (" << doubles_flops * iteration / Atrip::chrono["iterations"].count()
        << "GF)"
        << "\n";


      // PRINT TIMINGS
      if (in.chrono)
      for (auto const& pair: Atrip::chrono)
        LOG(1, " ") << pair.first << " :: "
                    << pair.second.count()
                    << std::endl;

    }

    const ABCTuple abc = is_fake_tuple(i)
                       ? tuples_list[tuples_list.size() - 1]
                       : tuples_list[i]
                  *abc_next = i == (tuples_list.size() - 1)
                            ? nullptr
                            : &tuples_list[i + 1]
                 ;

    WITH_CHRONO("with_rank",
      WITH_RANK << " :it " << iteration
                << " :abc " << pretty_print(abc)
                << " :abcN "
                << (abc_next ? pretty_print(*abc_next) : "None")
                << "\n";
    )


    // COMM FIRST DATABASE ================================================{{{1
    if (i == first_iteration) {
      WITH_RANK << "__first__:first database ............ \n";
      const auto db = communicate_database(abc, universe);
      WITH_RANK << "__first__:first database communicated \n";
      WITH_RANK << "__first__:first database io phase \n";
      do_io_phase(db);
      WITH_RANK << "__first__:first database io phase DONE\n";
      WITH_RANK << "__first__::::Unwrapping all slices for first database\n";
      for (auto& u: unions) u->unwrap_all(abc);
      WITH_RANK << "__first__::::Unwrapping slices for first database DONE\n";
      MPI_Barrier(universe);
    }

    // COMM NEXT DATABASE ================================================={{{1
    if (abc_next) {
      WITH_RANK << "__comm__:" << iteration << "th communicating database\n";
      WITH_CHRONO("db:comm",
        const auto db = communicate_database(*abc_next, universe);
      )
      WITH_CHRONO("db:io",
        do_io_phase(db);
      )
      WITH_RANK << "__comm__:" <<  iteration << "th database io phase DONE\n";
    }

    // COMPUTE DOUBLES ===================================================={{{1
    OCD_Barrier(universe);
    if (!is_fake_tuple(i)) {
      WITH_RANK << iteration << "-th doubles\n";
      WITH_CHRONO("oneshot-unwrap",
      WITH_CHRONO("unwrap",
      WITH_CHRONO("unwrap:doubles",
        for (auto& u: decltype(unions){&abph, &hhha, &taphh, &tabhh}) {
          u->unwrap_all(abc);
        }
      )))
      WITH_CHRONO("oneshot-doubles",
      WITH_CHRONO("doubles",
        doubles_contribution<F>( abc, (size_t)No, (size_t)Nv
                              // -- VABCI
                              , abph.unwrap_slice(Slice<F>::AB, abc)
                              , abph.unwrap_slice(Slice<F>::AC, abc)
                              , abph.unwrap_slice(Slice<F>::BC, abc)
                              , abph.unwrap_slice(Slice<F>::BA, abc)
                              , abph.unwrap_slice(Slice<F>::CA, abc)
                              , abph.unwrap_slice(Slice<F>::CB, abc)
                              // -- VHHHA
                              , hhha.unwrap_slice(Slice<F>::A, abc)
                              , hhha.unwrap_slice(Slice<F>::B, abc)
                              , hhha.unwrap_slice(Slice<F>::C, abc)
                              // -- TA
                              , taphh.unwrap_slice(Slice<F>::A, abc)
                              , taphh.unwrap_slice(Slice<F>::B, abc)
                              , taphh.unwrap_slice(Slice<F>::C, abc)
                              // -- TABIJ
                              , tabhh.unwrap_slice(Slice<F>::AB, abc)
                              , tabhh.unwrap_slice(Slice<F>::AC, abc)
                              , tabhh.unwrap_slice(Slice<F>::BC, abc)
                              // -- TIJK
#if defined(HAVE_CUDA)
                              , (DataFieldType<F>*)Tijk
#else
                              , Tijk.data()
#endif
                              );
        WITH_RANK << iteration << "-th doubles done\n";
      ))
    }

    // COMPUTE SINGLES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% {{{1
    OCD_Barrier(universe);
    if (!is_fake_tuple(i)) {
      WITH_CHRONO("oneshot-unwrap",
      WITH_CHRONO("unwrap",
      WITH_CHRONO("unwrap:singles",
        abhh.unwrap_all(abc);
      )))
      WITH_CHRONO("reorder",
        #pragma acc parallel
        for (size_t I(0); I < No*No*No; I++)
          ((DataFieldType<F>*)Zijk)[I] = ((DataFieldType<F>*)Tijk)[I];
      )
      WITH_CHRONO("singles",
      singles_contribution<F>( No, Nv, abc
#if defined(HAVE_CUDA)
                            , (F*)Tai
#else
                            , Tai.data()
#endif
                            , (F*)abhh.unwrap_slice(Slice<F>::AB, abc)
                            , (F*)abhh.unwrap_slice(Slice<F>::AC, abc)
                            , (F*)abhh.unwrap_slice(Slice<F>::BC, abc)
#if defined(HAVE_CUDA)
                            , (F*)Zijk);
#else
                            , Zijk.data());
#endif
      )
    }


    // COMPUTE ENERGY %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% {{{1
    if (!is_fake_tuple(i)) {
      double tuple_energy(0.);

      int distinct(0);
      if (abc[0] == abc[1]) distinct++;
      if (abc[1] == abc[2]) distinct--;
      const F epsabc(_epsa[abc[0]] + _epsa[abc[1]] + _epsa[abc[2]]);

      WITH_CHRONO("energy",
        if ( distinct == 0)
          tuple_energy = get_energy_distinct<F>(epsabc, No, (F*)epsi, (F*)Tijk, (F*)Zijk);
        else
          tuple_energy = get_energy_same<F>(epsabc, No, (F*)epsi, (F*)Tijk, (F*)Zijk);
      )

#if defined(HAVE_OCD) || defined(ATRIP_PRINT_TUPLES)
      tuple_energies[abc] = tuple_energy;
#endif

      energy += tuple_energy;

    }

    // TODO: remove this
    if (is_fake_tuple(i)) {
      // fake iterations should also unwrap whatever they got
      WITH_RANK << iteration
                << "th unwrapping because of fake in "
                << i << "\n";
      for (auto& u: unions) u->unwrap_all(abc);
    }

#ifdef HAVE_OCD
    for (auto const& u: unions) {
      WITH_RANK << "__dups__:"
                << iteration
                << "-th n" << u->name << " checking duplicates\n";
      u->check_for_duplicates();
    }
#endif


    // CLEANUP UNIONS %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%{{{1
    OCD_Barrier(universe);
    if (abc_next) {
      WITH_RANK << "__gc__:" << iteration << "-th cleaning up.......\n";
      for (auto& u: unions) {

        u->unwrap_all(abc);
        WITH_RANK << "__gc__:n" << u->name  << " :it " << iteration
                  << " :abc " << pretty_print(abc)
                  << " :abcN " << pretty_print(*abc_next)
                  << "\n";
        // for (auto const& slice: u->slices)
        //   WITH_RANK << "__gc__:guts:" << slice.info << "\n";
        u->clear_unused_slices_for_next_tuple(*abc_next);

        WITH_RANK << "__gc__: checking validity\n";

#ifdef HAVE_OCD
        // check for validity of the slices
        for (auto type: u->slice_types) {
          auto tuple = Slice<F>::subtuple_by_slice(abc, type);
        for (auto& slice: u->slices) {
          if ( slice.info.type == type
             && slice.info.tuple == tuple
             && slice.is_directly_fetchable()
             ) {
            if (slice.info.state == Slice<F>::Dispatched)
              throw std::domain_error( "This slice should not be undispatched! "
                                     + pretty_print(slice.info));
          }
        }
        }
#endif


      }
    }

      WITH_RANK << iteration << "-th cleaning up....... DONE\n";

    Atrip::chrono["iterations"].stop();
    // ITERATION END %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%{{{1

  }
    // END OF MAIN LOOP

  MPI_Barrier(universe);

  // PRINT TUPLES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%{{{1
#if defined(HAVE_OCD) || defined(ATRIP_PRINT_TUPLES)
  LOG(0,"Atrip") << "tuple energies" << "\n";
  for (size_t i = 0; i < np; i++) {
    MPI_Barrier(universe);
    for (auto const& pair: tuple_energies) {
      if (i == rank)
        std::cout << pair.first[0]
                  << " " << pair.first[1]
                  << " " << pair.first[2]
                  << std::setprecision(15) << std::setw(23)
                  << " tuple_energy: " << pair.second
                  << "\n"
                  ;
    }
  }
#endif

  // COMMUNICATE THE ENERGIES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%{{{1
  LOG(0,"Atrip") << "COMMUNICATING ENERGIES \n";
  double global_energy = 0;
  MPI_Reduce(&energy, &global_energy, 1, MPI_DOUBLE, MPI_SUM, 0, universe);

  WITH_RANK << "local energy " << energy << "\n";
  LOG(0, "Atrip") << "Energy: "
    << std::setprecision(15) << std::setw(23)
    << (- global_energy) << std::endl;

  // PRINT TIMINGS {{{1
  if (in.chrono)
  for (auto const& pair: Atrip::chrono)
    LOG(0,"atrip:chrono") << pair.first << " "
                          << pair.second.count() << std::endl;


  LOG(0, "atrip:flops(doubles)")
    << n_iterations * doubles_flops / Atrip::chrono["doubles"].count() << "\n";
  LOG(0, "atrip:flops(iterations)")
    << n_iterations * doubles_flops / Atrip::chrono["iterations"].count() << "\n";

  // TODO: change the sign in  the getEnergy routines
  return { - global_energy };

}
// instantiate
template Atrip::Output Atrip::run(Atrip::Input<double> const& in);
template Atrip::Output Atrip::run(Atrip::Input<Complex> const& in);

Debug and Logging

Macros

#pragma once
#include <functional>
#define ATRIP_BENCHMARK
//#define ATRIP_DONT_SLICE
#ifndef ATRIP_DEBUG
#  define ATRIP_DEBUG 1
#endif
//#define ATRIP_WORKLOAD_DUMP
#define ATRIP_USE_DGEMM
//#define ATRIP_PRINT_TUPLES

#ifndef ATRIP_DEBUG
#define ATRIP_DEBUG 1
#endif

#if ATRIP_DEBUG == 4
#  pragma message("WARNING: You have OCD debugging ABC triples "    \
                  "expect GB of output and consult your therapist")
#  include <dbg.h>
#  define HAVE_OCD
#  define OCD_Barrier(com) MPI_Barrier(com)
#  define WITH_OCD
#  define WITH_ROOT if (atrip::Atrip::rank == 0)
#  define WITH_SPECIAL(r) if (atrip::Atrip::rank == r)
#  define WITH_RANK std::cout << atrip::Atrip::rank << ": "
#  define WITH_CRAZY_DEBUG
#  define WITH_DBG
#  define DBG(...) dbg(__VA_ARGS__)
#elif ATRIP_DEBUG == 3
#  pragma message("WARNING: You have crazy debugging ABC triples,"  \
                  " expect GB of output")
#  include <dbg.h>
#  define OCD_Barrier(com)
#  define WITH_OCD if (false)
#  define WITH_ROOT if (atrip::Atrip::rank == 0)
#  define WITH_SPECIAL(r) if (atrip::Atrip::rank == r)
#  define WITH_RANK std::cout << atrip::Atrip::rank << ": "
#  define WITH_CRAZY_DEBUG
#  define WITH_DBG
#  define DBG(...) dbg(__VA_ARGS__)
#elif ATRIP_DEBUG == 2
#  pragma message("WARNING: You have some debugging info for ABC triples")
#  define OCD_Barrier(com)
#  define WITH_OCD if (false)
#  define WITH_ROOT if (atrip::Atrip::rank == 0)
#  define WITH_SPECIAL(r) if (atrip::Atrip::rank == r)
#  define WITH_RANK std::cout << atrip::Atrip::rank << ": "
#  define WITH_CRAZY_DEBUG if (false)
#  define WITH_DBG
#  define DBG(...) dbg(__VA_ARGS__)
#else
#  define OCD_Barrier(com)
#  define WITH_OCD if (false)
#  define WITH_ROOT if (false)
#  define WITH_SPECIAL(r) if (false)
#  define WITH_RANK if (false) std::cout << atrip::Atrip::rank << ": "
#  define WITH_DBG if (false)
#  define WITH_CRAZY_DEBUG if (false)
#  define DBG(...)
#endif

And users of the library can redefine the LOG macro which in case of not being defined is defined as follows:

#ifndef LOG
#define LOG(level, name) if (Atrip::rank == 0) std::cout << name << ": "
#endif

Furthermore, if you do not wish to see any output from ATRIP, simply define ATRIP_NO_OUTPUT

#ifdef ATRIP_NO_OUTPUT
#  undef LOG
#  define LOG(level, name) if (false) std::cout << name << ": "
#endif

Iteration informer

In general a code writer will want to write some messages in every iteration. A developer then can register a function to be used in this sense. The input of the function is an IterationDescriptor structure and the output should be nothing.

namespace atrip {

  struct IterationDescription;
  using IterationDescriptor = std::function<void(IterationDescription const&)>;
  struct IterationDescription {
    static IterationDescriptor descriptor;
    size_t current_iteration;
    size_t total_iterations;
    double current_elapsed_time;
  };

  void register_iteration_descriptor(IterationDescriptor);

}

Checkpoints and restarts

Prolog

#pragma once
#include <fstream>
#include <iomanip>

#include <atrip/Atrip.hpp>

namespace atrip {

Introduction

For very heavy workloads and possible bugs in the packages it is often useful to restart from a given state of the calculation.

An advantage of the atrip algorithm is that the state is essentially given by the

No: number of occupied orbitals
Nv: number of virtual orbitals
Nranks: number of ranks
Nnodes: number of nodes
Energy: the current total energy of the iterations
Iteration: the iteration number
Distribution: the type of distribution
RankRoundRobin: wether the round robin is done through the ranks or
  nodes

This information we can encode in a simple struct

// template <typename F>
struct Checkpoint {
  size_t no, nv;
  size_t nranks;
  size_t nnodes;
  double energy;
  size_t iteration;
  // TODO
  // Input<F>::TuplesDistribution distribution(GROUP_AND_SORT);
  bool rank_round_robin;
};

Input and output

In order to read and write the checkpoint information, we need to define a format. We choose a simple yaml format without any kind of depth, so that we can write quite easily a parser.

void write_checkpoint(Checkpoint const& c, std::string const& filepath) {
  std::ofstream out(filepath);
  out << "No: " << c.no
      << "\n"
      << "Nv: " << c.nv
      << "\n"
      << "Nranks: " << c.nranks
      << "\n"
      << "Nnodes: " << c.nnodes
      << "\n"
      << "Energy: " << std::setprecision(19) << c.energy
      << "\n"
      << "Iteration: " << c.iteration
      << "\n"
      << "RankRoundRobin: " << (c.rank_round_robin ? "true" : "false")
      << "\n";
}


Checkpoint read_checkpoint(std::ifstream& in) {
  Checkpoint c;
  // trim chars from the string, to be more sure and not use regexes
  auto trim = [](std::string& s, std::string const& chars) {
    s.erase(0, s.find_first_not_of(chars));
    s.erase(s.find_last_not_of(chars) + 1);
    return s;
  };
  for (std::string header, value; std::getline(in, header, ':');) {
    std::getline(in, value, '\n');
    trim(value, " \t"); // trim all whitespaces
    trim(header, " \t");

    /**/ if (header == "No")        c.no = std::atoi(value.c_str());
    else if (header == "Nv")        c.nv = std::atoi(value.c_str());
    else if (header == "Nranks")    c.nranks = std::atoi(value.c_str());
    else if (header == "Nnodes")    c.nnodes = std::atoi(value.c_str());
    else if (header == "Energy")    c.energy = std::atof(value.c_str());
    else if (header == "Iteration") c.iteration = std::atoi(value.c_str());
    else if (header == "RankRoundRobin") c.rank_round_robin = (value[0] == 't');
  }
  return c;
}


Checkpoint read_checkpoint(std::string const& filepath) {
  std::ifstream in(filepath);
  return read_checkpoint(in);
}

Test

#include <atrip/Checkpoint.hpp>
using namespace atrip;
#define _CMP_CHECK(what)                              \
  std::cout << "\t Checking " << #what  << std::endl; \
  assert(in.what == what);                            \
  assert(out.what == what);

  TESTCASE("Testing checkpoint reader and writers",
           const std::string out_checkpoint = "/tmp/checkpoint.yaml";
           const double energy = -1.493926352289995443;
           const size_t no = 154, nv = 1500, nranks = 48*10, nnodes = 10;
           const size_t iteration = 546;
           std::cout << "\twriting to " << out_checkpoint << std::endl;

           for (bool rank_round_robin: {true, false}) {
             atrip::Checkpoint out = {no,
                                      nv,
                                      nranks,
                                      nnodes,
                                      energy,
                                      iteration,
                                      rank_round_robin}, in;


             write_checkpoint(out, out_checkpoint);
             in = read_checkpoint(out_checkpoint);

             _CMP_CHECK(no);
             _CMP_CHECK(nv);
             _CMP_CHECK(nranks);
             _CMP_CHECK(nnodes);
             _CMP_CHECK(iteration);
             _CMP_CHECK(rank_round_robin);
             _CMP_CHECK(energy);
           }


           )
#undef _CMP_CHECK

Epilog

}

Miscellaneous

Complex numbers

#pragma once

#include <complex>
#include <mpi.h>
#include "config.h"
#if defined(HAVE_CUDA)
#include <cuComplex.h>
#endif

namespace atrip {

  using Complex = std::complex<double>;

  template <typename F> F maybe_conjugate(const F);

#if defined(HAVE_CUDA)
  cuDoubleComplex& operator+=(cuDoubleComplex& lz, cuDoubleComplex const& rz);
#endif

  namespace traits {

    template <typename FF> bool is_complex();

    namespace mpi {
      template <typename F> MPI_Datatype datatype_of(void);
    }

  }

}
#include <atrip/Complex.hpp>

namespace atrip {

  template <> double maybe_conjugate(const double a) { return a; }
  template <> Complex maybe_conjugate(const Complex a) { return std::conj(a); }

#if defined(HAVE_CUDA)
  /*
  __device__
  template <> double2
  maybe_conjugate(const double2 a) {
    return {a.x, -a.y};
  }
  */
  __device__
  template <> cuDoubleComplex
  maybe_conjugate(const cuDoubleComplex a) {
    return {a.x, -a.y};
  }
  /*
  __device__
  double2& operator+=(double2& lz, double2 const& rz) {
    lz.x += rz.x;
    lz.y += rz.y;
    return lz;
  }
  */
  __device__
  cuDoubleComplex& operator+=(cuDoubleComplex& lz, cuDoubleComplex const& rz) {
    lz.x += rz.x;
    lz.y += rz.y;
    return lz;
  }
#endif


  namespace traits {
    template <typename F> bool is_complex() { return false; }
    template <> bool is_complex<double>() { return false; }
    template <> bool is_complex<Complex>() { return true; }
  namespace mpi {
    template <> MPI_Datatype datatype_of<double>() { return MPI_DOUBLE; }
    template <> MPI_Datatype datatype_of<Complex>() { return MPI_DOUBLE_COMPLEX; }
  }
  }

}

Include header

#pragma once

#include <atrip/Atrip.hpp>