A library for reading SAS data (.sas7bdat) with Spark.
The latest jar can be downloaded from spark-packages.
Version | Scala Version | Spark Version |
---|---|---|
3.0.0-s_2.11 | 2.11.x | 2.4.x |
3.0.0-s_2.12 | 2.12.x | 3.0.x |
- This package allows reading SAS files from local and distributed filesystems, into Spark DataFrames.
- Schema is automatically inferred from metadata embedded in the SAS file. (Behaviour can be customised, see parameters below)
- The SAS format is splittable when not file-system compressed, thus we are able to convert a 200GB (1.5Bn rows) .sas7bdat file to .csv files using 2000 executors in under 2 minutes.
- This library uses parso for parsing as it is the only public available parser that handles both forms of SAS compression (CHAR and BINARY).
NOTE: this package does not support writing sas7bdat files
extractLabel
(Default:false
)- Boolean: extract column labels as column comments for Parquet/Hive
forceLowercaseNames
(Default:false
)- Boolean: force column names to lower case
inferDecimal
(Default:false
)- Boolean: infer numeric columns with format width >0 and format precision >0, as Decimal(Width, Precision)
inferDecimalScale
(Default:each column's format width
)- Int: scale of inferred decimals
inferFloat
(Default:false
)- Boolean: infer numeric columns with <=4 bytes, as Float
inferInt
(Default:false
)- Boolean: infer numeric columns with <=4 bytes, format width >0 and format precision =0, as Int
inferLong
(Default:false
)- Boolean: infer numeric columns with <=8 bytes, format width >0 and format precision =0, as Long
inferShort
(Default:false
)- Boolean: infer numeric columns with <=2 bytes, format width >0 and format precision =0, as Short
metadataTimeout
(Default:60
)- Int: number of seconds to allow reading of file metadata (stops corrupt files hanging)
minSplitSize
(Default:mapred.min.split.size
)- Long: minimum byte length of input splits (splits are always at least 1MB, to ensure correct reads)
maxSplitSize
(Default:mapred.max.split.size
)- Long: maximum byte length of input splits, (can be decreased to force higher parallelism)
NOTE:
- the order of precedence for numeric type inference is: Long -> Int -> Short -> Decimal -> Float -> Double
- sas doesn’t have a concept of Long/Int/Short, instead people typically use column formatters with 0 precision
val df = {
spark.read
.format("com.github.saurfang.sas.spark")
.option("forceLowercaseNames", true)
.option("inferLong", true)
.load("cars.sas7bdat")
}
df.write.format("csv").option("header", "true").save("newcars.csv")
You can also use the implicit readers:
import com.github.saurfang.sas.spark._
// DataFrameReader
val df = spark.read.sas("cars.sas7bdat")
df.write.format("csv").option("header", "true").save("newcars.csv")
// SQLContext
val df2 = sqlContext.sasFile("cars.sas7bdat")
df2.write.format("csv").option("header", "true").save("newcars.csv")
(Note: you cannot use parameters like inferLong
with the implicit readers.)
df = spark.read.format("com.github.saurfang.sas.spark").load("cars.sas7bdat", forceLowercaseNames=True, inferLong=True)
df.write.csv("newcars.csv", header=True)
df <- read.df("cars.sas7bdat", source = "com.github.saurfang.sas.spark", forceLowercaseNames = TRUE, inferLong = TRUE)
write.df(df, path = "newcars.csv", source = "csv", header = TRUE)
SAS data can be queried in pure SQL by registering the data as a (temporary) table.
CREATE TEMPORARY VIEW cars
USING com.github.saurfang.sas.spark
OPTIONS (path="cars.sas7bdat")
We included a simple SasExport
Spark program that converts .sas7bdat to .csv or .parquet files:
sbt "run input.sas7bdat output.csv"
sbt "run input.sas7bdat output.parquet"
To achieve more parallelism, use spark-submit
script to run it on a Spark cluster. If you don't have a spark
cluster, you can always run it in local mode and take advantage of multi-core.
spark-shell --master local[4] --packages saurfang:spark-sas7bdat:3.0.0-s_2.12
spark-csv
writes outnull
as "null" in csv text output. This means if you read it back for a string type, you might actually read "null" instead ofnull
. The safest option is to export in parquet format where null is properly recorded. See databricks/spark-csv#147 for alternative solution.
This project would not be possible without parso continued improvements and generous contributions from @mulya, @thesuperzapper, and many others. We are hornored to be a recipient of 2020 WiseWithData ELEVATE Awards and appreciate their generous donations.