You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, the Spark Connect client for Rust is highly experimental and should
not be used in any production setting. This is currently a "proof of concept" to identify the methods
of interacting with Spark cluster from rust.
The spark-connect-rs aims to provide an entrypoint to Spark Connect, and provide similar DataFrame API interactions.
Project Layout
├── crates <- crates for the implementation of the client side spark-connect bindings
│ └─ connect <- crate for 'spark-connect-rs'
│ └─ protobuf <- connect protobuf for apache/spark
├── examples <- examples of using different aspects of the crate
├── datasets <- sample files from the main spark repo
Future state would be to have additional crates that allow for easier creation of other language bindings.
Getting Started
This section explains how run Spark Connect Rust locally starting from 0.
Spark Connect should respect the format as long as your cluster supports the specified type and has the
required jars
DataFrameWriter
API
Comment
bucketBy
csv
format
insertInto
jdbc
json
mode
option
options
orc
parquet
partitionBy
save
saveAsTable
sortBy
text
DataFrameWriterV2
DataFrameWriterV2
API
Comment
append
create
createOrReplace
option
options
overwrite
overwritePartitions
partitionedBy
replace
tableProperty
using
DataStreamReader
DataStreamReader
API
Comment
csv
format
json
load
option
options
orc
parquet
schema
table
text
DataStreamWriter
Start a streaming job and return a StreamingQuery object to handle the stream operations.
DataStreamWriter
API
Comment
foreach
foreachBatch
format
option
options
outputMode
Uses an Enum for OutputMode
partitionBy
queryName
start
toTable
trigger
Uses an Enum for TriggerMode
StreamingQuery
StreamingQuery
API
Comment
awaitTermination
exception
explain
processAllAvailable
stop
id
isActive
lastProgress
name
recentProgress
runId
status
StreamingQueryManager
StreamingQueryManager
API
Comment
awaitAnyTermination
get
resetTerminated
active
StreamingQueryListener
StreamingQueryListener
API
Comment
onQueryIdle
onQueryProgress
onQueryStarted
onQueryTerminated
DataFrame
Spark DataFrame type object and its implemented traits.
DataFrame
API
Comment
agg
alias
approxQuantile
cache
checkpoint
Not part of Spark Connect
coalesce
colRegex
collect
columns
corr
count
cov
createGlobalTempView
createOrReplaceGlobalTempView
createOrReplaceTempView
createTempView
crossJoin
crosstab
cube
describe
distinct
drop
dropDuplicatesWithinWatermark
drop_duplicates
dropna
dtypes
exceptAll
explain
fillna
filter
first
foreach
foreachPartition
freqItems
groupBy
head
hint
inputFiles
intersect
intersectAll
isEmpty
isLocal
isStreaming
join
limit
localCheckpoint
Not part of Spark Connect
mapInPandas
TBD on this exact implementation
mapInArrow
TBD on this exact implementation
melt
na
observe
offset
orderBy
persist
printSchema
randomSplit
registerTempTable
repartition
repartitionByRange
replace
rollup
sameSemantics
sample
sampleBy
schema
select
selectExpr
semanticHash
show
sort
sortWithinPartitions
sparkSession
stat
storageLevel
subtract
summary
tail
take
to
toDF
toJSON
Does not return an RDD but a long JSON formatted String
toLocalIterator
toPandas to_polars & toPolars
Convert to a polars::frame::DataFrame
new to_datafusion & toDataFusion
Convert to a datafusion::dataframe::DataFrame
transform
union
unionAll
unionByName
unpersist
unpivot
where
use filter instead, where is a keyword for rust
withColumn
withColumns
withColumnRenamed
withColumnsRenamed
withMetadata
withWatermark
write
writeStream
writeTo
Column
Spark Column type object and its implemented traits
Column
API
Comment
alias
asc
asc_nulls_first
asc_nulls_last
astype
between
cast
contains
desc
desc_nulls_first
desc_nulls_last
dropFields
endswith
eqNullSafe
getField
This is depreciated but will need to be implemented
getItem
This is depreciated but will need to be implemented
ilike
isNotNull
isNull
isin
like
name
otherwise
over
Refer to Window for creating window specifications
rlike
startswith
substr
when
withField
eq ==
Rust does not like when you try to overload == and return something other than a bool. Currently implemented column equality like col('name').eq(col('id')). Not the best, but it works for now
addition +
subtration -
multiplication *
division /
OR |
AND &
XOR ^
Negate ~
Data Types
Data types are used for creating schemas and for casting columns to specific types
Column
API
Comment
ArrayType
BinaryType
BooleanType
ByteType
DateType
DecimalType
DoubleType
FloatType
IntegerType
LongType
MapType
NullType
ShortType
StringType
CharType
VarcharType
StructField
StructType
TimestampType
TimestampNTZType
DayTimeIntervalType
YearMonthIntervalType
Literal Types
Create Spark literal types from these rust types. E.g. lit(1_i64) would be a LongType() in the schema.
An array can be made like lit([1_i16,2_i16,3_i16]) would result in an ArrayType(Short) since all the values of the slice can be translated into literal type.
Spark Literal Type
Rust Type
Status
Null
Binary
&[u8]
Boolean
bool
Byte
Short
i16
Integer
i32
Long
i64
Float
f32
Double
f64
Decimal
String
&str / String
Date
chrono::NaiveDate
Timestamp
chrono::DateTime<Tz>
TimestampNtz
chrono::NaiveDateTime
CalendarInterval
YearMonthInterval
DayTimeInterval
Array
slice / Vec
Map
Create with the function create_map
Struct
Create with the function struct_col or named_struct
Window & WindowSpec
For ease of use it's recommended to use Window to create the WindowSpec.
Window
API
Comment
currentRow
orderBy
partitionBy
rangeBetween
rowsBetween
unboundedFollowing
unboundedPreceding
WindowSpec.orderBy
WindowSpec.partitionBy
WindowSpec.rangeBetween
WindowSpec.rowsBetween
Functions
Only a few of the functions are covered by unit tests. Functions involving closures or lambdas are not feasible.