Skip to content
hadley edited this page Jun 11, 2013 · 13 revisions

R-C interface

Introduction

Reading the source code of R is an extremely powerful technique for improving your R programming. However, at some point you will hit a brick wall: many R functions are implemented in C. This guide gives you a basic introduction to C and R's internal C api, giving you the basic knowledge needed to read the internals of R that are written in C.

If you want to write new high-performance code, we do not recommend using C, but instead strongly recommend using Rcpp to connect to C++. The Rcpp API protects you from many of the historical idiosyncracies of the R API, takes care of memory management for you, and provides many useful helper methods

The contents of this chapter come from section 5 ("System and foreign language interfaces") of Writing R extensions, focussing on best practices and modern tools. This means it does not cover:

  • the .C interface
  • the old api defined in Rdefines.h
  • esoteric language features that are rarely used

To understand existing C code, it's useful to generate simple examples of your own that you can experiment with. To that end, all examples in this chapter use the inline package, which makes it extremely easy to get up and running with C code. Make sure you have it installed and loaded with the following code:

install.packages("inline")
library(inline)

You'll also (obviously) need a working C compiler. Windows users can use Rupert Murdoch's Rtools. Mac users will need the Xcode command line tools. Most linux distributions will come with the necessary compilers.

Differences between R and C

Even if you've never used C before, you should be able to read C code because the basic structure is similar to R. If you want to learn it more formally The C programming language by Kernigan and Ritchie is a classic.

Important differences from R include:

  • variables can only store specific types of object, and must be declared before use
  • objects are modified in place, unless you specifically copy the object
  • indices start at 0, not 1
  • you must use a semi-colon at end of each line
  • you must have an explicit return statement
  • assignment is done with =, not <-

Calling C functions from R

Generally, calling C functions from R involves two parts: a C function and an R function that uses .Call. The simple function below adds two numbers together and illustrates some of the important features of coding in C (creating new R vectors, coercing input arguments to the appropriate type and dealing with garbage collection).

# In C ----------------------------------------
#include <R.h>
#include <Rinternals.h>

SEXP add(SEXP a, SEXP b) {
  SEXP result;

  PROTECT(result = allocVector(REALSXP, 1));
  REAL(result)[0] = asReal(a) + asReal(b);
  UNPROTECT(1);

  return(result);
}
# In R ----------------------------------------
add <- function(a, b) {
  .Call("add", a, b)
}

In this chapter we'll produce these two pieces in one step by using the inline package. This allows us to write:

add <- cfunction(signature(a = "integer", b = "integer"), "
  SEXP result;

  PROTECT(result = allocVector(REALSXP, 1));
  REAL(result)[0] = asReal(a) + asReal(b);
  UNPROTECT(1);

  return(result);
")
add(1, 5)

The C functions and macros that R provides for us to modify R data structures are all defined in the header file Rinternals.h. It's easiest to find and display this file from within R:

rinternals <- file.path(R.home(), "include", "Rinternals.h")
file.show(rinternals)

Before we begin writing and reading C code, we need to know a little about the basic data structures.

Basic data structures

At the C-level, all R objects are stored in a common datatype, the SEXP. (Technically, this is a pointer to a structure with typedef SEXPREC). A SEXP is a variant type, with subtypes for all R's data structures. The most important types are:

  • REALSXP: numeric vectors
  • INTSXP: integer vectors
  • LGLSXP: logical vectors
  • STRSXP: character vectors
  • VECSXP: lists
  • CLOSXP: functions (closures)
  • ENVSXP: environments

Beware: At the C level, R's lists are VECSXPs not LISTSXPs. This is because early implementations of R used LISP-like linked lists (now known as "pairlists") before moving to the S-like generic vectors that we now know as lists.

There are also SEXPs for less common object types:

  • CPLXSXP: complex vectors
  • LISTSXP: "pair" lists. At the R level, you only need to care about the distinction lists and pairlists for function arguments, but internally they are used in many more places.
  • DOTSXP: '...'
  • SYMSXP: names/symbols
  • NILSXP: NULL

And SEXPs for internal objects, objects that are usually only created and used by C functions, not R functions:

  • LANGSXP: language constructs
  • CHARSXP: "scalar" strings (see below)
  • PROMSXP: promises, lazily evaluated function arguments
  • EXPRSXP: expressions

There's no built-in R function to easily access these names, but we can write one: (This is adapted from code in R's inspect.c)

sexp_type <- cfunction(c(x = "ANY"), '
  switch (TYPEOF(x)) {
    case NILSXP:      return mkString("NILSXP");
    case SYMSXP:      return mkString("SYMSXP");
    case LISTSXP:     return mkString("LISTSXP");
    case CLOSXP:      return mkString("CLOSXP");
    case ENVSXP:      return mkString("ENVSXP");
    case PROMSXP:     return mkString("PROMSXP");
    case LANGSXP:     return mkString("LANGSXP");
    case SPECIALSXP:  return mkString("SPECIALSXP");
    case BUILTINSXP:  return mkString("BUILTINSXP");
    case CHARSXP:     return mkString("CHARSXP");
    case LGLSXP:      return mkString("LGLSXP");
    case INTSXP:      return mkString("INTSXP");
    case REALSXP:     return mkString("REALSXP");
    case CPLXSXP:     return mkString("CPLXSXP");
    case STRSXP:      return mkString("STRSXP");
    case DOTSXP:      return mkString("DOTSXP");
    case ANYSXP:      return mkString("ANYSXP");
    case VECSXP:      return mkString("VECSXP");
    case EXPRSXP:     return mkString("EXPRSXP");
    case BCODESXP:    return mkString("BCODESXP");
    case EXTPTRSXP:   return mkString("EXTPTRSXP");
    case WEAKREFSXP:  return mkString("WEAKREFSXP");
    case S4SXP:       return mkString("S4SXP");
    case RAWSXP:      return mkString("RAWSXP");
    default:          return mkString("<unknown>");
}')
sexp_type(10)
sexp_type(10L)
sexp_type("a")
sexp_type(T)
sexp_type(list(a = 1))
sexp_type(pairlist(a = 1))

Character vectors

R character vectors are stored as STRSXPs, a vector type where every element is a CHARSXP. CHARSXPs are read-only objects and must never be modified. In particular, the C-style string contained in a CHARSXP should be treated as read-only; it's hard to do otherwise because the CHAR accessor function returns a const char*.

Strings have this more complicated design because individual CHARSXP's (elements of a character vector) can be shared between multiple strings. This is an optimisation to reduce memory usage, and can result in unexpected behaviour:

x <- "banana"
y <- rep(x, 1e6)
object.size(x)
# 32-bit: 64 bytes
# 64-bit: 96 bytes
object.size(y) / 1e6
# 32-bit: 4.000056 bytes
# 32-bit: 8.000088 bytes

In 32-bit R, factors occupy about the same amount of memory as strings: both pointers and integers are 4 bytes. In 64-bit R, pointers are 8 bytes, so factors take about twice as much memory as strings.

Coercion and object creation

At the heart of every C function will be a set of conversions between R data structures and C data structures. Inputs and output will always be R data structures (SEXPs) and you will need to convert them to C data structures in order to do any work. An additional complication is the garbage collector: if you don't claim every R object you create, the garbage collector will think they are unused and delete them.

Object creation and garbage collection

The simplest way to create an new R-level object is allocVector, which takes two arguments, the type of SEXP (or SEXPTYPE) to create, and the length of the vector. The following code code creates a three element list containing a logical vector, a numeric vector and an integer vector:

dummy <- cfunction(body = '
  SEXP vec, real, lgl, ints;

  PROTECT(real = allocVector(REALSXP, 2));
  PROTECT(lgl = allocVector(LGLSXP, 10));
  PROTECT(ints = allocVector(INTSXP, 10));

  PROTECT(vec = allocVector(VECSXP, 3));
  SET_VECTOR_ELT(vec, 0, real);
  SET_VECTOR_ELT(vec, 1, lgl);
  SET_VECTOR_ELT(vec, 2, ints);

  UNPROTECT(4);
  return(vec);
')
dummy()

You might wonder what all the PROTECT calls do. They tell R that we're currently using each object, and not to delete it if the garbage collector is activated. (We don't need to protect objects that R already knows about, like function arguments).

You also need to make sure that every protected object is unprotected. UNPROTECT takes a single integer argument, n, and unprotects the last n objects that were protected. If your calls don't match, R will warn about a "stack imbalance in .Call".

Other specialised forms of PROTECT and UNPROTECT are needed in some circumstances: UNPROTECT_PTR(s) unprotects the object pointed to by the SEXP s, PROTECT_WITH_INDEX saves an index of the protection location that can be used to replace the protected value using REPROTECT. Consult the R externals section on garbage collection for more details.

If you run dummy() a few times, you'll notice the output is basically random. This is because allocVector assigns memory to each output, but it doesn't clean it out first. For real functions, you'll want to loop through each element in the vector and zero it out. The most efficient way to do that is to use memset:

zeroes <- cfunction(c(n_ = "integer"), '
  int n = asInteger(n_);
  SEXP out;

  PROTECT(out = allocVector(INTSXP, n));
  memset(INTEGER(out), 0, n * sizeof(int));
  UNPROTECT(1);

  return out;
')
zeroes(10);

Allocation shortcuts

There are a few shortcuts for allocating matrices and 3d arrays:

allocMatrix(SEXPTYPE mode, int nrow, int ncol)
alloc3DArray(SEXPTYPE mode, int nrow, int ncol, int nface)

Beware allocList - it creates a pairlist, not a regular list.

The mkNamed function simplifies the creation of named vectors. The following code is equivalent to list(a = NULL, b = NULL, c = NULL):

const char *names[] = {"a", "b", "c", ""};
mkNamed(VECSXP, names);

Extracting C vectors

There is a helper function for each atomic vector (apart from character, see following) that allows you to index into a SEXP and access the C-level data structure that lives at its heart.

The following example shows how to use the helper function REAL to inspect and modify a numeric vector:

add_one <- cfunction(c(x = "numeric"), "
  SEXP out;
  int n = length(x);

  PROTECT(out = allocVector(REALSXP, n));
  for (int i = 0; i < n; i++) {
    REAL(out)[i] = REAL(x)[i] + 1;
  }
  UNPROTECT(1);

  return out;
")
add_one(as.numeric(1:10))

There are similar helpers for logical, LOGICAL(x), integer, INTEGER(x), complex COMPLEX(x) and raw vectors RAW(x).

If you're working with long vectors, there's a performance advantage to using the helper function once and saving the result in a pointer:

add_two <- cfunction(c(x = "numeric"), "
  SEXP out;
  int n = length(x);
  double *px, *pout;

  PROTECT(out = allocVector(REALSXP, n));

  px = REAL(x);
  pout = REAL(out);
  for (int i = 0; i < n; i++) {
    pout[i] = px[i] + 2;
  }
  UNPROTECT(1);

  return out;
")
add_two(as.numeric(1:10))

library(microbenchmark)
x <- as.numeric(1:1e6)
microbenchmark(
  add_one(x),
  add_two(x)
)

On my computer, add_two is about twice as fast as add_one for a million element vector. This is a common idiom in the R source code.

Strings and lists are more complicated because the individual elements are SEXPs not C-level data structures. You can use STRING_ELT(x, i) and VECTOR_ELT(x, i) to extract individual components of strings and lists respectively. To get a single C string from a element in a R character vector, use CHAR(STRING_ELT(x, i)). Set values in a list or character vector with SET_VECTOR_ELT and SET_STRING_ELT.

Modifying strings

String vectors are a little more complicated. As discussed earlier, a string vector is a vector made up of pointers to immutable CHARSXPs, and it's the CHARSXP that contains the C string (which can be extracted using CHAR). The following function shows how to create a vector of fixed values:

abc <- cfunction(NULL, '
  SEXP out;
  PROTECT(out = allocVector(STRSXP, 3));

  SET_STRING_ELT(out, 0, mkChar("a"));
  SET_STRING_ELT(out, 1, mkChar("b"));
  SET_STRING_ELT(out, 2, mkChar("c"));

  UNPROTECT(1);

  return out;
')
abc()

Things are a little harder if you want to modify the strings in the vector because you need to know a lot about string manipulation in C (which is hard, and harder to do right). For any problem that involves any kind of string modification, you're better off using Rcpp.

The following function just makes a copy of a string, so you can at least see how all the pieces work together.

copy <- cfunction(c(x = "character"), '
  SEXP out;
  int n = length(x);
  const char* letter;

  PROTECT(out = allocVector(STRSXP, n));
  for (int i = 0; i < n; i++) {
    letter = CHAR(STRING_ELT(x, i));
    SET_STRING_ELT(out, i, mkChar(letter));
  }
  UNPROTECT(1);
  
  return out;
')
copy(letters)

One last useful function for operating with strings: we can use STRING_PTR to get a pointer to the STRING_ELTs in a vector, such that we can access the STRING_ELTs by de-referencing the pointer. Occasionally, this can be easier to work with. We'll show how this can make a simple example of reversing a vector of strings easier.

reverse <- cfunction( signature(x="character"), '
  SEXP out;
  int len = length(x);
  PROTECT( out = allocVector(STRSXP, len) );
  SEXP* out_ptr = STRING_PTR(out);
  SEXP* x_ptr = STRING_PTR(x);
  for( int i=0; i < len; ++i ) {
    out_ptr[i] = x_ptr[len-i-1];
  }
  UNPROTECT(1);
  return out;
')

reverse(letters)

Coercing scalars

There also a few helper functions if you want to turn the first element of an R vector into a C scalar:

  • asLogical(x): INTSXP -> int
  • asInteger(x): INTSXP -> int
  • asReal(x): REALSXP -> double
  • CHAR(asChar(x)): STRSXP -> const char*

And similarly it's easy to turn a C scalar into a length-one R vector:

  • ScalarLogical(x): int -> LGLSXP
  • ScalarInteger(x): int -> INTSXP
  • ScalarReal(x): double -> REALSXP
  • mkString(x): const char* -> STRSXP

These all create R-level objects, so need to be PROTECTed.

Modifying objects

You must be very careful when modifying an object that the user has passed into the function. The following function has some very unexpected behaviour:

add_three <- cfunction(c(x = "numeric"), '
  REAL(x)[0] = REAL(x)[0] + 3;
  return(x);
')
x <- 1
y <- x
add_three(x)
x
y

Not only has it modified the value of x, but it has also modified y! This happens because of the way that R implements it's copy-on-modify philosophy. It does so lazily, so a complete copy only has to be made if you make a change: x and y point to the same object, and the object is only duplicated if you change either x or y.

To avoid problems like this, always duplicate() inputs before modifying them:

add_four <- cfunction(c(x = "numeric"), '
  SEXP x_copy;
  PROTECT(x_copy = duplicate(x));
  REAL(x_copy)[0] = REAL(x_copy)[0] + 4;
  UNPROTECT(1);
  return(x_copy);
')
x <- 1
y <- x
add_four(x)
x
y

Pairlists and symbols

R hides a few details of the underlying datastructures it uses. In some places in R code, it looks like you're working with a list (VECSXP), but behind the scenes R you're actually modifying a pairlist (LISTSXP). These include attributes, calls and ....

Pairlists differ from lists in the following ways:

  • Pairlists are linked lists, a data structure which does not have any easy way to get to an arbitrary element of the list

  • Pairlists have tags, not names, and tags are symbols, not strings.

Because you can't easily index into a specified location in a pairlist, R provides a set of helper functions to moved along the linked list. The basic helpers are CAR which extracts the first element of the first, and CDR which extracts the rest. These can be composed to get CAAR, CDAR, CADDR, CADDR, CADDDR. As well as the getters, R also provides SETCAR, SETCDR etc. (Or you can just use )

car <- cfunction(c(x = "ANY"), 'return(CAR(x));')
cdr <- cfunction(c(x = "ANY"), 'return(CDR(x));')
cadr <- cfunction(c(x = "ANY"), 'return(CADR(x));')

x <- quote(f(a = 1, b = 2))
car(x)
cdr(x)
car(cdr(x))
cadr(x)
```R

You can make new pairlists with `CONS` or `LCONS` (if you want a call object). A pairlist is always terminated with `R_NilValue`.

```R
new_call <- cfunction(NULL, '
  SEXP out;

  out = LCONS(install("+"), LCONS(
      ScalarReal(10), LCONS(
        ScalarReal(5), R_NilValue)));
  return out;
')
new_call();

Similarly, you can loop through all elements of a pairlist as follows:

count <- cfunction(c(x = "ANY"), '
  SEXP el, nxt;
  int i = 0;

  for(nxt = x; nxt != R_NilValue; el = CAR(nxt), nxt = CDR(nxt)) {
    i++;
  }
  return(ScalarInteger(i));
')
count(quote(f(a, b, c)))
count(quote(f()))
```R

`TAG` and `SET_TAG` allow you to get and set the tag (aka name) associated with an element of a pairlist. The tag should be a symbol. To create a symbol (the equivalent of `as.symbol` or `as.name` in R), use `install`. 

Attributes are also pairlists behind the scenes, but come with the helper functions `setAttrib` and `getAttrib` to make access a little easier:

```R
set_attr <- cfunction(c(obj = "ANY", attr = "character", value = "ANY"), '
  const char* attr_s = CHAR(asChar(attr));

  duplicate(obj);
  setAttrib(obj, install(attr_s), value);
  return(obj);
')
x <- 1:10
set_attr(x, "a", 1)

There are some (confusingly named) shortcuts for common setting operations: classgets, namesgets, dimgets and dimnamesgets are the internal versions of the default methods of class<-, names<-, dim<- and dimnames<-.

tags <- cfunction(c(x = "ANY"), '
  SEXP el, nxt, out;
  int i = 0;

  for(nxt = CDR(x); nxt != R_NilValue; nxt = CDR(nxt)) {
    i++;
  }

  PROTECT(out = allocVector(VECSXP, i));

  for(nxt = CDR(x), i = 0; nxt != R_NilValue; i++, nxt = CDR(nxt)) {
    SET_VECTOR_ELT(out, i, TAG(nxt));
  }

  UNPROTECT(1);

  return out;
')
tags(quote(f(a = 1, b = 2, c = 3)))
tags(quote(f()))

Missing and non-finite values

For floating point numbers, R's NA is a subtype of NaN so IEEE 754 arithmetic should handle it correctly. However, it is unwise to depend on such details, and is better to deal with missings explicitly:

  • In REALSXPs, use the ISNA macro, ISNAN, or R_FINITE macros to check for missing, NaN or non-finite values. Use the constants NA_REAL, R_NaN, R_PosInf and R_NegInf to set those values

  • In INTSXPs, compare/set values to NA_INTEGER

  • In LGLSXPs, compare/set values to NA_LOGICAL

  • In STRSXPs, compare/set CHAR(STRING_ELT(x, i)) values to NA_STRING.

For example, a primitive implementation of is.NA might look like

is_na <- cfunction(c(x = "ANY"), '
  SEXP out;
  int n = length(x);

  PROTECT(out = allocVector(LGLSXP, n));

  for (int i = 0; i < n; i++) {
    switch(TYPEOF(x)) {
      case LGLSXP:
        LOGICAL(out)[i] = (LOGICAL(x)[i] == NA_LOGICAL);
        break;
      case INTSXP:
        LOGICAL(out)[i] = (INTEGER(x)[i] == NA_INTEGER);
        break;
      case REALSXP:
        LOGICAL(out)[i] = ISNA(REAL(x)[i]);
        break;
      case STRSXP:
        LOGICAL(out)[i] = (STRING_ELT(x, i) == NA_STRING);
        break;
      default:
        LOGICAL(out)[i] = NA_LOGICAL;
    }
  }
  UNPROTECT(1);

  return out;
')
is_na(c(NA, 1L))
is_na(c(NA, 1))
is_na(c(NA, "a"))
is_na(c(NA, TRUE))

It's worth noting that R's base::is.na returns TRUE for both NA and NaNs in a numeric vector, as opposed to the C level ISNA macro, which returns TRUE only for NA_REALs.

There are a few other special values:

nil <- cfunction(NULL, 'return(R_NilValue);')
unbound <- cfunction(NULL, 'return(R_UnboundValue);')
missing_arg <- cfunction(NULL, 'return(R_MissingArg);')

x <- missing_arg()
x

Checking types in C

If the user provides different input to your function to what you're expecting (e.g. provides a list instead of a numeric vector), it's very easy to crash R. For this reason, it's a good idea to write a wrapper function that checks arguments are of the correct type, or coerces them if necessary. It's usually easier to do this at the R level. For example, going back to our first example of C code, we might rename it to add_ and then write a wrapper function to check the inputs are ok:

add_ <- cfunction(signature(a = "integer", b = "integer"), "
  SEXP result;

  PROTECT(result = allocVector(REALSXP, 1));
  REAL(result)[0] = asReal(a) + asReal(b);
  UNPROTECT(1);

  return(result);
")
add <- function(a, b) {
  stopifnot(is.numeric(a), is.numeric(b), length(a) == 1, length(b) == 1)
  add_(a, b)
}

Or if we wanted to be more accepting of diverse inputs:

add <- function(a, b) {
  a <- as.numeric(a)
  b <- as.numeric(b)

  if (length(a) > 1) warning("Only first element of a used")
  if (length(a) > 1) warning("Only first element of b used")
  
  add_(a, b)
}

To coerce objects at the C level, use PROTECT(new = coerceVector(old, SEXPTYPE)). This will return an error if the SEXP can not be converted to the desired type. Note that these coercion functions do not use S3 dispatch.

To check if an object is of a specified type, you can use TYPEOF, which returns a SEXPTYPE:

is_numeric <- cfunction(c("x" = "ANY"), "
  return(ScalarLogical(TYPEOF(x) == REALSXP));
")
is_numeric(7)
is_numeric("a")

Or you can use one of the many helper functions. They all return 0 for FALSE and 1 for TRUE:

  • For atomic vectors: isInteger, isReal, isComplex, isLogical, isString.

  • For combinations of atomic vectors: isNumeric (integer, logical, real), isNumber (integer, logical, real, complex), isVectorAtomic (logical, interger, numeric, complex, string, raw)

  • Matrices (isMatrix) and arrays (isArray)

  • For other more esoteric object: isEnvironment, isExpression, isList (a pair list), isNewList (a list), isSymbol, isNull, isObject (S4 objects), isVector (atomic vectors, lists, expressions)

Note that some of these functions behave differently to the R-level functions with similar names. For example isVector is true for any atomic vector type, lists and expression, where is.vector is returns TRUE only if its input has no attributes apart from names.

Finding the C source code for a function

In many R functions you'll find code like .Internal(mean(x)) or .Primitive("sum"). That means that most of the function is implemented at the C-level. There are two steps to finding the corresponding C source code:

  • First, open src/main/names.c and search for the name of the function. You'll find an entry that tells you the name of the function (which always starts with do_)

  • Next, search the R source code for the name of that function. To make it easier to find where it's defined (rather than everywhere it's used), you can add (SEXP. e.g. to find the source code for findInterval, search for do_findinterval(SEXP.

.External

An alternative to using .Call is to use .External. It is used almost identically, except that the C function will recieve a single arugment containing a LISTSXP, a pairlist from which the arguments can be extracted. This makes it possible to write functions that take a variable number of arguments.

inline does not currently support .External functions.

Using C code in a package

If you're putting your code in a package, it's generally a good idea to stop using inline and revert back to separate R and C functions. At a minimum, you'll need:

  • R files in a R/ directory
  • C files in src/ directory
  • A DESCRIPTION in the main directory
  • A NAMESPACE file containing useDynLib(packagename), which can be generated using a roxygen2 tag: @useDynLib packagename

Running load_all(path/to/package) will automatically compile and reload the code in your package.

Your C code will need these headers:

#include <R.h>
#include <Rinternals.h>
Clone this wiki locally