Skip to content

Latest commit

 

History

History
80 lines (64 loc) · 2.47 KB

README.md

File metadata and controls

80 lines (64 loc) · 2.47 KB

cutorch-rtc

Basic feature list:

  • cutorch.launchPTX function
  • apply kernels from cutorch

This package brings CUDA 7 runtime compilation to Torch. Linux or OS X with C++11 compiler required. Installation:

luarocks install https://raw.githubusercontent.com/szagoruyko/cutorch-rtc/master/cutorch-rtc-scm-1.rockspec

Then after requiring cutorch-rtc you will get launchPTX function, which can run ptx code generated with NVRTC, and cutorch.apply functions:

require 'cutorch-rtc'
t = torch.randn(8):cuda()
t:apply1'x = x < 0 ? 0 : x'

That would be a simple ReLU implementation.

Documentation

cutorch.launchPTX

Runs compiled PTX.

function cutorch.launchPTX(ptx, kernel_name, arguments, gridDim, blockDim)

Arguments:

  • ptx - compiled PTX lua string
  • kernel_name - name of kernel to run from the given PTX
  • arguments - lua table with CudaTensors as inputs and subtables in the form {'int', n} to provide scalar arguments
  • gridDim - size of the grid table, has to have at least one value, others will be filled with ones
  • blockDim - size of block table, again has to have at least one value, others will be ones

PTX can be generated in runtime with https://github.com/szagoruyko/nvrtc.torch

Short example:

local kernel = [[
extern "C" __global__
void kernel(float *a, int n)
{
  int tx = blockIdx.x*blockDim.x + threadIdx.x;
  if(tx < n)
  a[tx] *= 2.f;
}
]]

local ptx = nvrtc.compileReturnPTX(kernel)
local a = torch.randn(32):cuda()
local b = a:clone()
cutorch.launchPTX(ptx, 'kernel', {a, {'int', a:numel()}}, {1}, {32})

apply1

Applies provided operator to a tensor:

function CudaTensor.apply1(self, op)

op has to be a lua string assigning a value to variable 'x'. CUDA built-in device functions can be used, see CUDA documentation for more information. Multiline ops supported, has to be separated with ; Both contiguous and non-contiguous tensors are valid. First call to any apply operation takes about 0.5s, then the compiled code is cached and other calls are fast.

apply2

Applies provided operator using two tensors:

function CudaTensor.apply2(self, a, op)

op has to use 'x' and 'y' - self and a tensors. Can assign values to both tensors. See apply1 for properties.

apply3

Applies provided operator using three tensors:

function CudaTensor.apply3(self, a, b, op)

op has to use 'x', 'y' and 'z' - self, a and b tensors. Can assign values to all three tensors. See apply1 for properties.