Skip to content

Commit

Permalink
change UUID <-> Arrow mapping to (de)serialize to/from 16-byte FixedS…
Browse files Browse the repository at this point in the history
…izeBinary (#103)

* change UUID <-> Arrow mapping to (de)serialize to/from 16-byte FixedSizeBinary

* fix tests

* optimize UInt128 <-> NTuple{16,UInt8} casting

Co-authored-by: SimonDanisch <[email protected]>

Co-authored-by: SimonDanisch <[email protected]>
  • Loading branch information
jrevels and SimonDanisch authored Jan 11, 2021
1 parent cbf4456 commit 31476c9
Show file tree
Hide file tree
Showing 3 changed files with 34 additions and 13 deletions.
6 changes: 4 additions & 2 deletions docs/src/manual.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,10 +53,12 @@ Apart from letting other packages have all the fun, an `Arrow.Table` itself can

In the arrow data format, specific logical types are supported, a list of which can be found [here](https://arrow.apache.org/docs/status.html#data-types). These include booleans, integers of various bit widths, floats, decimals, time types, and binary/string. While most of these map naturally to types builtin to Julia itself, there are a few cases where the definitions are slightly different, and in these cases, by default, they are converted to more "friendly" Julia types (this auto conversion can be avoided by passing `convert=false` to `Arrow.Table`, like `Arrow.Table(file; convert=false)`). Examples of arrow to julia type mappings include:

* `Date`, `Time`, `Timestamp`, and `Duration` all have natural Julia defintions in `Dates.Date`, `Dates.Time`, `TimeZones.ZonedDateTime`, and `Dates.Period` subtypes, respectively.
* `Date`, `Time`, `Timestamp`, and `Duration` all have natural Julia defintions in `Dates.Date`, `Dates.Time`, `TimeZones.ZonedDateTime`, and `Dates.Period` subtypes, respectively.
* `Char` and `Symbol` Julia types are mapped to arrow string types, with additional metadata of the original Julia type; this allows deserializing directly to `Char` and `Symbol` in Julia, while other language implementations will see these columns as just strings
* Similarly to the above, the `UUID` Julia type is mapped to a 128-bit `FixedSizeBinary` arrow type.
* `Decimal128` and `Decimal256` have no corresponding builtin Julia types, so they're deserialized using a compatible type definition in Arrow.jl itself: `Arrow.Decimal`


Note that when `convert=false` is passed, data will be returned in Arrow.jl-defined types that exactly match the arrow definitions of those types; the authoritative source for how each type represents its data can be found in the arrow [`Schema.fbs`](https://github.com/apache/arrow/blob/master/format/Schema.fbs) file.

#### Custom types
Expand Down Expand Up @@ -118,7 +120,7 @@ With `Arrow.write`, you provide either an `io::IO` argument or `file::String` to
What are some examples of Tables.jl-compatible sources? A few examples include:
* `Arrow.write(io, df::DataFrame)`: A `DataFrame` is a collection of indexable columns
* `Arrow.write(io, CSV.File(file))`: read data from a csv file and write out to arrow format
* `Arrow.write(io, DBInterface.execute(db, sql_query))`: Execute an SQL query against a database via the [`DBInterface.jl`](https://github.com/JuliaDatabases/DBInterface.jl) interface, and write the query resultset out directly in the arrow format. Packages that implement DBInterface include [SQLite.jl](https://juliadatabases.github.io/SQLite.jl/stable/), [MySQL.jl](https://juliadatabases.github.io/MySQL.jl/dev/), and [ODBC.jl](http://juliadatabases.github.io/ODBC.jl/latest/).
* `Arrow.write(io, DBInterface.execute(db, sql_query))`: Execute an SQL query against a database via the [`DBInterface.jl`](https://github.com/JuliaDatabases/DBInterface.jl) interface, and write the query resultset out directly in the arrow format. Packages that implement DBInterface include [SQLite.jl](https://juliadatabases.github.io/SQLite.jl/stable/), [MySQL.jl](https://juliadatabases.github.io/MySQL.jl/dev/), and [ODBC.jl](http://juliadatabases.github.io/ODBC.jl/latest/).
* `df |> @map(...) |> Arrow.write(io)`: Write the results of a [Query.jl](https://www.queryverse.org/Query.jl/stable/) chain of operations directly out as arrow data
* `jsontable(json) |> Arrow.write(io)`: Treat a json array of objects or object of arrays as a "table" and write it out as arrow data using the [JSONTables.jl](https://github.com/JuliaData/JSONTables.jl) package
* `Arrow.write(io, (col1=data1, col2=data2, ...))`: a `NamedTuple` of `AbstractVector`s or an `AbstractVector` of `NamedTuple`s are both considered tables by default, so they can be quickly constructed for easy writing of arrow data if you already have columns of data
Expand Down
36 changes: 26 additions & 10 deletions src/arrowtypes.jl
Original file line number Diff line number Diff line change
Expand Up @@ -44,14 +44,6 @@ struct PrimitiveType <: ArrowType end
ArrowType(::Type{<:Integer}) = PrimitiveType()
ArrowType(::Type{<:AbstractFloat}) = PrimitiveType()

arrowconvert(::Type{UInt128}, u::UUID) = UInt128(u)
arrowconvert(::Type{UUID}, u::UInt128) = UUID(u)

# This method is included as a deprecation path to allow reading Arrow files that may have
# been written before Arrow.jl defined its own UUID <-> UInt128 mapping (in which case
# a struct-based fallback `JuliaLang.UUID` extension type may have been utilized)
arrowconvert(::Type{UUID}, u::NamedTuple{(:value,),Tuple{UInt128}}) = UUID(u.value)

struct BoolType <: ArrowType end
ArrowType(::Type{Bool}) = BoolType()

Expand All @@ -77,6 +69,30 @@ ArrowType(::Type{NTuple{N, T}}) where {N, T} = FixedSizeListType()
gettype(::Type{NTuple{N, T}}) where {N, T} = T
getsize(::Type{NTuple{N, T}}) where {N, T} = N

ArrowType(::Type{UUID}) = FixedSizeListType()
gettype(::Type{UUID}) = UInt8
getsize(::Type{UUID}) = 16

function _unsafe_cast(::Type{B}, a::A)::B where {B,A}
a = Ref(a)
b = Ref{B}()
GC.@preserve a b begin
ptra = Base.unsafe_convert(Ptr{A}, a)
ptrb = Base.unsafe_convert(Ptr{B}, b)
unsafe_copyto!(Ptr{A}(ptrb), ptra, 1)
end
return b[]
end

arrowconvert(::Type{NTuple{16,UInt8}}, u::UUID) = _unsafe_cast(NTuple{16,UInt8}, u.value)
arrowconvert(::Type{UUID}, u::NTuple{16,UInt8}) = UUID(_unsafe_cast(UInt128, u))

# These methods are included as deprecation paths to allow reading Arrow files that may have
# been written before Arrow.jl's current UUID <-> NTuple{16,UInt8} mapping existed (in which case
# a struct-based fallback `JuliaLang.UUID` extension type may have been utilized)
arrowconvert(::Type{UUID}, u::NamedTuple{(:value,),Tuple{UInt128}}) = UUID(u.value)
arrowconvert(::Type{UUID}, u::UInt128) = UUID(u)

struct StructType <: ArrowType end

ArrowType(::Type{<:NamedTuple}) = StructType()
Expand Down Expand Up @@ -125,7 +141,7 @@ default(::Type{NamedTuple{names, types}}) where {names, types} = NamedTuple{name
const JULIA_TO_ARROW_TYPE_MAPPING = Dict{Type, Tuple{String, Type}}(
Char => ("JuliaLang.Char", UInt32),
Symbol => ("JuliaLang.Symbol", String),
UUID => ("JuliaLang.UUID", UInt128),
UUID => ("JuliaLang.UUID", NTuple{16,UInt8}),
)

istyperegistered(::Type{T}) where {T} = haskey(JULIA_TO_ARROW_TYPE_MAPPING, T)
Expand All @@ -140,7 +156,7 @@ end
const ARROW_TO_JULIA_TYPE_MAPPING = Dict{String, Tuple{Type, Type}}(
"JuliaLang.Char" => (Char, UInt32),
"JuliaLang.Symbol" => (Symbol, String),
"JuliaLang.UUID" => (UUID, UInt128),
"JuliaLang.UUID" => (UUID, NTuple{16,UInt8}),
)

function extensiontype(f, meta)
Expand Down
5 changes: 4 additions & 1 deletion test/runtests.jl
Original file line number Diff line number Diff line change
Expand Up @@ -194,9 +194,12 @@ tt = Arrow.Table(io)
@test length(tt) == length(t)
@test all(isequal.(values(t), values(tt)))

# 89 - test deprecation path for old UUID autoconversion
# 89 etc. - test deprecation paths for old UUID autoconversion + UUID FixedSizeListType overloads
u = 0x6036fcbd20664bd8a65cdfa25434513f
@test Arrow.ArrowTypes.arrowconvert(UUID, (value=u,)) === UUID(u)
@test Arrow.ArrowTypes.arrowconvert(UUID, u) === UUID(u)
@test Arrow.ArrowTypes.gettype(UUID) == UInt8
@test Arrow.ArrowTypes.getsize(UUID) == 16

# 98
t = (a = [Nanosecond(0), Nanosecond(1)], b = [uuid4(), uuid4()], c = [missing, Nanosecond(1)])
Expand Down

0 comments on commit 31476c9

Please sign in to comment.