Skip to content

Latest commit

 

History

History
419 lines (332 loc) · 15.8 KB

README.md

File metadata and controls

419 lines (332 loc) · 15.8 KB

Dr.Jit logo Dr.Jit logo

Dr.Jit-Core — A tracing GPU/CPU Just-In-Time-Compiler

Continuous Integration
rgl-ci

Introduction

This repository contains an efficient and self-contained just-in-time (JIT) compiler that can vectorize and parallelize computation. It was designed to to accelerate differentiable Monte Carlo rendering that requires dynamic compilation of large amounts of derivative code, though other types of embarrassingly parallel computation are likely to benefit as well.

This library exposes a C and C++ interface that can be used to trace computation, which means that the system internally builds a graph representation of all steps while postponing their evaluation for as long as possible. When the traced computation is finally evaluated, the system fuses all operations into an efficient kernel containing queued computation that is asynchronously evaluated on a desired device. On the CPU, this involves compilation of vectorized LLVM IR and parallel execution using a thread pool, while GPU compilation involves NVIDIA PTX, and either CUDA or OptiX depending on whether or not ray tracing operations are used.

This project can be used independently or as part of the larger Dr.Jit project, which furthermore provides support for automatic differentiation, multidimensional arrays/tensors, and a large library of mathematical functions.

The Dr.Jit-Core library has almost no dependencies: it can be compiled without CUDA, OptiX, or LLVM actually being present on the system (it will attempt to find them at runtime as needed). The library is implemented in C++14 but exposes all functionality through a C99-compatible interface.

Features

Dr.Jit has the following features:

  • Runs on Linux (X86_64), macOS (x86_64 & aarch64), and Windows (x86_64). Other platforms may work as well but have not been tested.

  • Targets

    1. NVIDIA GPUs via CUDA (compute capability 5.0 or newer), and

    2. CPUs via LLVM leveraging available vector instruction set extensions (e.g. Neon or AVX/AVX2/AVX512).

  • Captures and compiles pure arithmetic, side effects, and higher-level operations (loops and dynamic method dispatch) that are preserved 1:1 in generated kernels.

  • Performs several basic optimizations to reduce the amount of LLVM/PTX IR passed to the next compiler stage.

    • Dead code elimination
    • Constant propagation
    • Common subexpression elimination via local value numbering
  • Supports parallel kernel execution on multiple devices (JITing from several CPU threads, or running kernels on multiple GPUs).

  • Provides a fast caching memory allocator that operates in the execution stream of an asynchronous computation device. This addresses a common performance bottleneck.

  • Caches and reuses kernels when the same computation is encountered again. Caching is done both in memory and on disk (~/.drjit on Linux and macOS, ~/AppData/Local/Temp/drjit on Windows).

  • Provides a variety of parallel reductions for convenience.

An example (C++)

The header file drjit-core/array.h provides a convenient C++ wrapper with operator operator overloading building on the C-level API (drjit-core/jit.h). Here is an brief example on how it can be used:

#include <drjit-core/array.h>

using Bool   = CUDAArray<bool>;
using Float  = CUDAArray<float>;
using UInt32 = CUDAArray<uint32_t>;

The above snippet sets up a group of "capitalized" types that invoke the JIT compiler. Any arithmetic involving instances of such types, e.g.,

UInt32 c, a = /* .. */, b = /* .. */;

c = (a + b) * 5;

will conceptually expand to a parallel loop that processes the individual array elements, e.g.,

for (int i = 0; i < array_size; ++i) /* in parallel */ {
    uint32_t tmp0 = a[i] + b[i];
    c[i] = tmp0 * 5;
}

This evaluation of this loop is decoupled from the original program—in effect, running the program decides what the contents of this loop should be.

Let's look at a concrete example, using the previously defined types

// Create a floating point array with 101 linearly spaced entries
// [0, 0.01, 0.02, ..., 1]
Float x = linspace<Float>(0, 1, 101);

// [0, 2, 4, 8, .., 98]
UInt32 index = arange<UInt32>(50) * 2;

// Equivalent to "y = x[index]"
Float y = gather(x, index);

/// Comparisons produce mask arrays
Bool mask = x < .5f;

// Ternary operator
Float z = select(mask, sqrt(x), 1.f / x);

printf("Value is = %s\n", z.str());

Running this program will trigger two kernel launches. The first generates the x array (size 100) when it is accessed by the gather() operation, and the second generates z (size 50) when it is printed in the last line. Both correspond to points during the execution where evaluation could no longer be postponed, e.g., because of the cross-lane memory dependency in the former case.

Simply changing the first lines to

#include <drjit-core/llvm.h>

using Bool   = LLVMArray<bool>;
using Float  = LLVMArray<float>;
using UInt32 = LLVMArray<uint32_t>;

switches to the functionally equivalent LLVM backend. By default, the LLVM backend parallelizes execution via a built-in thread pool, enabling usage that is very similar to the CUDA variant: a single thread issues computation that is then processed in parallel by all cores of the system.

How it works

To understand a bit better how all of this works, we can pop one level down to the C-level interface. The first operation jit_init initializes Dr.Jit and searches for LLVM and/or CUDA as instructed by the user. Note that users don't need to install the CUDA SDK—just having an NVIDIA graphics driver is enough.

jit_init(JitBackendCUDA);

Let's calculate something: we will start by creating a single-precision floating point variable that is initialized with the value 0.5. This involves the function jit_var_f32, which creates a literal constant variable that depends on no other variables.

uint32_t v0 = jit_var_f32(JitBackendCUDA, .5f);
);

This is a scalar variable, which means that it will produce a single element if evaluated alone, but it can also occur in any computation involving larger arrays and will expand to the needed size.

Programs using Dr.Jit will normally create and destroy vast numbers of variables, and this operation is therefore highly optimized. The operation creates an entry in a very efficient hash table mapping the resulting variable index v0 to a record (backend, type, <operands>). Over time, this hash table will expand to the size that is needed to support the active computation, and from this point onward jit_var_..() operations will not involve any further dynamic memory allocation.

Let's do some computation with this variable: we can create a "counter", which is an Dr.Jit array containing an increasing sequence of integer elements [0, 1, 2, .., 9] in this case.

uint32_t v1 = jit_var_counter(/* backend = */ JitBackendCUDA,
                              /* size    = */ 10);

Counters always have the variable type VarTypeUInt32 that we next convert into a single precision floating point variable.

uint32_t v2 = jit_var_cast(/* index       = */ v1,
                           /* target_type = */ VarTypeFloat32,
                           /* reinterpret = */ 0);

Finally, let's create a more interesting variable that references some of the previous results via op0 and op1.

uint32_t v3 = jit_var_add(v0, v2)

Suppose that we don't plan to perform any further computation / accesses involving v0, v1, and v2. This must be indicated to Dr.Jit by reducing their reference count.

jit_var_dec_ref(v0);
jit_var_dec_ref(v1);
jit_var_dec_ref(v2);

They still have a nonzero internal reference count (i.e. by Dr.Jit itself) since the variable v3 depends on them, and this keeps them from being garbage-collected.

Note that no real computation has happened yet—so far, we were simply manipulating hash table entries. Let's finally observe the result of this calculation by printing the array contents:

printf("Result: %s\n", jit_var_str(v3));

This step internally invokes jit_var_eval(v3) to evaluate the variable, which creates a CUDA kernel containing all steps that are needed to compute the contents of v3 and write them into device-resident memory.

During this compilation step, the following happens: Dr.Jit first traverses the relevant parts of the variable hash table and concatenates all string templates (with appropriate substitutions) into a complete PTX representation. This step is highly optimized and takes on the order of a few microseconds.

Once the final PTX string is available, two things can happen: potentially we've never seen this particular sequence of steps before, and in that case the PTX code must be further compiled to machine code ("SASS", or streaming assembly). This step involves a full optimizing compiler embedded in the GPU driver, which tends to be very slow: usually it's a factor of 1000-10000× slower than the preceding steps within Dr.Jit.

However, once a kernel has been compiled, Dr.Jit will remember it using both an in-memory and an on-disk cache. In programs that perform the same sequence of steps over and over again (e.g. optimization), the slow PTX→SASS compilation step will only occur in the first iteration. Evaluation of v2 will turn the variable from a symbolic representation into a GPU-backed array, and further queued computation accessing it will simply index into that array instead of repeating the original computation.

At the end of the program, we must not forget to decrease the reference count associated with v2, which will release the array from memory. Finally, jit_shutdown() releases any remaining resources held by Dr.Jit.

jit_var_dec_ref(v3);
jit_shutdown(0);

Running this program on a Linux machine provides the following output:

jit_init(): creating directory "/home/wjakob/.drjit" ..
jit_init(): detecting devices ..
jit_cuda_init(): enabling CUDA backend (version 11.1)
 - Found CUDA device 0: "GeForce RTX 3090" (PCI ID 65:00.0, compute cap. 8.6, 82 SMs w/99 KiB shared mem., 23.7 GiB global mem.)
jit_eval(): launching 1 kernel.
  -> launching e93e70f12fcaea9c (n=10, in=0, out=1, ops=8, jit=2.4 us):
     cache miss, build: 33.417 ms, 2.98 KiB.
jit_eval(): done.
Result: [0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5]
jit_shutdown(): releasing 1 kernel ..
jit_shutdown(): releasing 1 thread state ..
jit_shutdown(): done
jit_cuda_shutdown()

These log messages show that Dr.Jit generated a single kernel within 2.4 μs. However, this kernel was never observed before, necessitating a compilation step by the CUDA driver, which took 33 ms.

Note the Result: [...] line, which is the expected output of the calculation 0.5 + [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]. The extra lines are debug statements that can be controlled by setting the log level to a higher or lower level. A log callback can also be provided to e.g. route such messages to a file.

Let's actually increase the log level to see some more detail of what is happening under the hood. This can be done by adding the following two lines at the beginning of the program

jit_set_log_level_stderr(LogLevelDebug);
jit_set_flag(JitFlagPrintIR, 1);

This produces the following detailed output (there is also LogLevelTrace for the truly adventurous):

jit_init(): detecting devices ..
jit_cuda_init(): enabling CUDA backend (version 11.1)
 - Found CUDA device 0: "GeForce RTX 3090" (PCI ID 65:00.0, compute cap. 8.6, 82 SMs w/99 KiB shared mem., 23.7 GiB global mem.)
jit_var_new(float32 r1): mov.$t0 $r0, 0.5
jit_var_new(uint32 r2[10]): mov.u32 $r0, %r0
jit_var_new(float32 r3[10] <- r2): cvt.rn.$t0.$t1 $r0, $r1
jit_var_cast(float32 r3 <- uint32 r2)
jit_var_new(float32 r4[10] <- r1, r3): add.$t0 $r0, $r1, $r2
jit_eval(): launching 1 kernel.
  -> launching e93e70f12fcaea9c (n=10, in=0, out=1, ops=8, jit=2.9 us):
jit_eval(): launching 1 kernel.
.version 6.0
.target sm_60
.address_size 64

.entry drjit_e93e70f12fcaea9cecd06e2b4b9ab180(.param .align 8 .b8 params[16]) {
    .reg.b8   %b <8>; .reg.b16 %w<8>; .reg.b32 %r<8>;
    .reg.b64  %rd<8>; .reg.f32 %f<8>; .reg.f64 %d<8>;
    .reg.pred %p <8>;

    mov.u32 %r0, %ctaid.x;
    mov.u32 %r1, %ntid.x;
    mov.u32 %r2, %tid.x;
    mad.lo.u32 %r0, %r0, %r1, %r2;
    ld.param.u32 %r2, [params];
    setp.ge.u32 %p0, %r0, %r2;
    @%p0 bra done;

    mov.u32 %r3, %nctaid.x;
    mul.lo.u32 %r1, %r3, %r1;

body: // sm_75
    mov.f32 %f4, 0.5;
    mov.u32 %r5, %r0;
    cvt.rn.f32.u32 %f6, %r5;
    add.f32 %f7, %f4, %f6;
    ld.param.u64 %rd0, [params+8];
    mad.wide.u32 %rd0, %r0, 4, %rd0;
    st.global.cs.f32 [%rd0], %f7;

    add.u32 %r0, %r0, %r1;
    setp.ge.u32 %p0, %r0, %r2;
    @!%p0 bra body;

done:
    ret;
}
     cache hit, load: 69.195 us, 2.98 KiB.
jit_eval(): cleaning up..
jit_eval(): done.
jit_shutdown(): releasing 1 kernel ..
jit_shutdown(): releasing 1 thread state ..
jit_flush_malloc_cache(): freed
 - device memory: 64 B in 1 allocation
jit_shutdown(): done
jit_cuda_shutdown()
Result: [0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5]

Note in particular the PTX fragments that includes the lines

L1: // Loop body
    mov.f32 %f4, 0.5;
    mov.u32 %r5, %r0;
    cvt.rn.f32.u32 %f6, %r5;
    add.f32 %f7, %f4, %f6;

These lines exactly corresponding to the variables v0 to v3 that we had previously defined. The surrounding code establishes a grid-stride loop that processes all array elements. This time around, the kernel compilation was skipped, and Dr.Jit loaded the kernel from the on-disk cache file ~/.drjit/e93e70f12fcaea9cecd06e2b4b9ab180.cuda.bin containing a LZ4-compressed version of code and compilation output. The odd hexadecimal value is simply the XXH3 hash of the kernel source code.

When a kernel includes OptiX function calls (ray tracing operations), kernels are automatically launched through OptiX instead of the CUDA driver API.

LLVM backend

The preceding section provided a basic example of Dr.Jit in combination with CUDA. LLVM works essentially the same way. Now, the backend= flag must be set to JitBackendLLVM.

The LLVM backend operates on vectors matching the SIMD instruction set of the host processor such as AVX/AVX2/AVX512 or ARM NEON.

A kernel transforming less than a few thousands of elements will be JIT-compiled and executed immediately on the current thread. For large arrays, Dr.Jit will automatically parallelize evaluation via a thread pool. The repository includes nanothread as a git submodule, which is a minimal implementation of the components that are necessary to realize this. The size of this thread pool can also be set to zero, in which case all computation will occur on the current thread. In this case, another type of parallelism is available by using Dr.Jit from multiple threads at once.