Skip to content

olivia76/cpp-sas7bdat

Repository files navigation

cpp-sas7bdat

CMake on Ubuntu-latest CMake on MacOS-latest CMake on Windows-2019 codecov CodeQL

Summary

This is a c++ 17 implementation of a SAS7BDAT file reader. This project also provides a python and R interfaces.

This is a toy project with cmake and C++ external polymorphism.

How to compile, tests, ...

make conan-install  # Download and install conan inside your python environment
make                # Compile with cmake and conan
make tests          # Invoke ctest
make tests-python   # Build python package and execute the tests
make tests-R        # Build R package and execute the tests
make conan-package  # Create and install locally the package for conan
make benchmark      # Execute the benchmark
make clean          # Clean the build directory

What is a SAS7BDAT file?

A SAS7BDAT file is a database storage file created by Statistical Analysis System (SAS) software. It contains a binary encoded dataset used for advanced analytics, business intelligence, data management, predictive analytics, and more. (from fileinfo.com)

Different projects already exists to read SAS7BDAT files:

This project is based on the following implementations:

C++ external polymorphism

This design pattern is very nicely explained in a talk of Klaus Iglberger - Breaking Dependencies: Type Erasure - A Design Analysis (https://www.youtube.com/watch?v=7GIz9SmRgyc)

The idea has been presented in September 1998 in "C++ Report" paper by Chris Cleeland and Douglas C. Schmidt introducing the external polymorphism pattern (https://www.dre.vanderbilt.edu/~schmidt/PDF/C++-EP.pdf).

This pattern allows classes that are not related by inheritance and/or have no virtual methods to be treated polymorphically. This pattern combines C++ language features with patterns like Adapter and Decorator to give the appearance of polymorphic behavior on otherwise unrelated classes.

Summary

A functionality Foo uses a concept X with different methods, i.e. foo(...). The concept X can be seen as an interface that a concrete implementation needs to fulfill. The concrete implementation doesn't have to derive from the interface. The link between the two is done via the model class. Please note that this model class is called adapter in the original paper.

One of the main advantage of this pattern is the complete isolation of each the concrete implementations as they are not linked to a base class, see for example the talk of Sean Parent inheritance is the base class of evil. They only have to expose a set of methods.

summary

This pattern is used at different levels within this package:

  • data source
  • dataset sink
  • selection/filtering of the dataset's columns
  • dataset's column formatters

Data source

A data source based on std::ifstream is provided in this package:

Dataset sink

3 simple dataset sinks are provided in this package:

The first one directly prints the content of the file (header and data) to the screen and the second one is a very basic csv writer (no field protection beside the double quotes, no encoding, ...).

Column filtering

The package provides several filtering options:

Dataset's columns

Each column has a specific type and conversion/format operators. The external polymorphism is used internally to store the exact operator, including endianness, 32/64 bits, ...

Supported types:

  • string (std::string)
  • integer (long)
  • number (double)
  • datetime (boost::posix_time::ptime)
  • date (boost::gregorian::date)
  • time (boost::posix_time::time_duration)

Each formatter class implements one or several getters as well as the to_string method.

Relationship schema

relationship

Usage

c++

Here is an example for converting a sas7bdat file to a csv file:

// See for example apps/cppsas7bdat-ci.cpp
#include <cppsas7bdat/reader.hpp>
#include <cppsas7bdat/source/ifstream.hpp>
#include <cppsas7bdat/sink/csv.hpp>

void sas7bdat_to_csv(const char* _filename_sas7bdat,
                     const char* _filename_csv)
{
	std::ofstream csv_os(_filename_csv);
	cppsas7bdat::Reader(
		cppsas7bdat::datasource::ifstream(_filename_sas7bdat), cppsas7bdat::datasink::csv(csv_os)).read_all();
}

It is possible to provide your own data source and sinks:

#include <cppsas7bdat/reader.hpp>

struct MyDataSource {
	MyDataSource(...) { /* ... */ }

	/// This method is called to check if there is any more data
	bool eof() { /* ... */ }
	/// This method is called to read data
	bool read_bytes(void* _p, const size_t _length) { /* ... */ }
};

struct MyDataSink {
	MyDataSink(...) { /* ... */ }

	/// This method is called once the header/metadata is read.
	void set_properties(const cppsas7bdat::Properties& _properties) { /* ... */ }
	/// This method is called for each new row.
	void push_row(const size_t _irow, cppsas7bdat::Column::PBUF _p) { /* ... */ }
	/// This method is called at the end of data
	void end_of_data() { /* ... */ }
};

void read_sas7bdat(...)
{
	cppsas7bdat::Reader reader(MyDataSource(...), MyDataSink(...));

	// Read row by row
	while(reader.read_row());

	// OR Read chunk by chunk
	while(reader.read_rows(chunk_size));

	// OR read the whole file
	reader.read_all();
}

Python

3 sinks -- SinkByRow(), SinkByChunk(chunk_size) and SinkWholeData() -- are provided by the pycppsas7bdat python package. They use pandas.DataFrame to store the data.

from pycppsas7bdat.read_sas import read_sas

sink = read_sas("filename.sas7bdat", include=[...], exclude=[...])
print(sink.properties)
print(sink.df)
from pycppsas7bdat import Reader
from pycppsas7bdat.sink import SinkByRow, SinkByChunk, SinkWholeData

sink = SinkByRow() # or SinkByChunk() or SinkWholeData()   
r = Reader("filename.sas7bdat", sink, include=[...], exclude=[...])
# Read row by row
while r.read_row(): pass

# Read chunk by chunk
while r.read_rows(chunk_size): pass

# OR read the whole file
r.read_all()

# export to pandas.DataFrame
print(sink.properties)
print(sink.df)

It is easy to write your own sinks:

class MySink(object):
    rows = []

    def set_properties(self, properties): # This method must be defined
		"""
		@brief: Called once after reading the header and metadata
		@param properties: A Properties object with the header, metadata and columns definition
		"""
        self.columns = [col.name for col in properties.metadata.columns]

    def push_row(self, irow, row): # This method must be defined
		"""
		@brief: Called for every row
		@param irow: Zero-based index of the row
		@param row: A list of value, one for each column.
		"""
        self.rows.append(row)

class MySinkChunk(object):
	chunks = []
	chunk_size = 10000   # This member must be present for a SinkChunk

	def set_properties(self, properties): # This method must be defined
        """
		@brief: Called once after reading the header and metadata
		@param properties: A Properties object with the header, metadata and columns definition
		"""
		self.columns = [col.name for col in properties.metadata.columns]

    def push_rows(self, istartrow, iendrow, rows): # This method must be defined
		"""
		@brief: Called for every read chunk of data
		@param istartrow: Zero-based index for the start row
		@param iendrow: Zero-based index for the end row (included)
		@param rows: A dict of list of values. The keys are the columns'names.
		"""
		chunks.append(rows)

R

The R package provides a function to directly read a sas7bdat file in a data.frame:

require(CPPSAS7BDAT)

sink <- CPPSAS7BDAT::read_sas("path/to/file.sas7bdat", include=c(...), exclude=c(...));

properties <- sink$properties;
df <- sink$df;

You can also provide your own sink with a R6 class:

require(CPPSAS7BDAT)
library(R6)

MySink <- R6Class("MySink",
     public=list(
        initialize = function() {
        },
        set_properties = function(properties) {
        },
        push_row = function(irow, row) {
        }
     )
);

MySinkChunk <- R6Class("MySinkChunk",
     public=list(
        chunk_size = NULL,
        initialize = function(chunk_size=10000) {
            self$chunk_size = chunk_size;
        },
        set_properties = function(properties) {
        },
        push_rows = function(istartrow, iendrow, rows) {
        }
     )
);

sink <- MySink$new(); # OR MySinkChunk$new(10000);
r <- CPPSAS7BDAT::sas_reader("path/to/file.sas7bdat", sink, include=c(...), exclude=c(...));
r$read_all(); OR r$read_row(); OR r$read_rows(chunk_size)

properties <- sink$properties;
df <- sink$df;

Performance

File cppsas7bdat -- native ¹ cppsas7bdat -- python ¹ cppsas7bdat -- R ² SASLib.jl ³ readstat ¹ pandas ¹ sas7bdat -- python ¹
data_AHS2013/topical.sas7bdat 0.080 s 0.45 s 0.30s 1.1 s 1.8 s 11s 28 s
data_misc/numeric_1000000_2.sas7bdat 0.013 s 0.21 s 0.02s 0.085 s 1.1 s 0.9s 5.5 s

¹ Measurements done with hyperfine
² Measurements done with rbenchmark
³ Measurements done with Julia/BenchmarkTools
13M, 84355 rows x 114 cols
16M, 1000000 rows x 2 cols

Unit tests

The unit tests use more than 170 files from different sources with different encodings, compressions and endianness.

cmake

Inspired from https://github.com/cpp-best-practices/cpp_starter_project

Dependencies

Conan

This install conan inside your python environment and download all the dependencies missing on your system:

pip install conan
conan install conanfile.py --build=missing

Alternatively, you can install the following dependencies manually:

boost-1.71

sudo apt-get install libboost1.71-all-dev

Catch2

git clone https://github.com/catchorg/Catch2 --branch v2.x
cd Catch2
mkdir build; cd build; cmake ..; make; sudo make install

fmt

git clone https://github.com/fmtlib/fmt.git
cd fmt
mkdir build; cd build; cmake ..; make; sudo make install

sdplog

git clone https://github.com/gabime/spdlog.git
cd spdlog
cmake -S . -B ./build -DSPDLOG_FMT_EXTERNAL=ON; cmake --build ./build; cd build; sudo make install

docopt

git clone [email protected]:docopt/docopt.cpp.git
cd docopt.cpp
mkdir build; cd build; cmake ..; make; sudo make install

json

git clone [email protected]:nlohmann/json.git
cd json
mkdir build; cd build; cmake ..; make; sudo make install