Skip to content

Commit

Permalink
Support AbstractPath where file paths are used (#255)
Browse files Browse the repository at this point in the history
* Support AbstractPath where file paths are used
* Set package to version 2.2.0

Co-authored-by: Jarrett Revels <[email protected]>
  • Loading branch information
omus and jrevels authored Oct 29, 2021
1 parent 56f8f93 commit a3eec89
Show file tree
Hide file tree
Showing 5 changed files with 47 additions and 13 deletions.
6 changes: 4 additions & 2 deletions Project.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name = "Arrow"
uuid = "69666777-d1a9-59fb-9406-91d4454c9d45"
authors = ["quinnj <[email protected]>"]
version = "2.1.0"
version = "2.2.0"

[deps]
ArrowTypes = "31f734f8-188a-4ce0-8406-c8a06bd891cd"
Expand All @@ -23,6 +23,7 @@ BitIntegers = "0.2"
CodecLz4 = "0.4"
CodecZstd = "0.7"
DataAPI = "1"
FilePathsBase = "0.9"
PooledArrays = "0.5, 1.0"
SentinelArrays = "1"
Tables = "1.1"
Expand All @@ -31,10 +32,11 @@ julia = "1.3"

[extras]
CategoricalArrays = "324d7699-5711-5eae-9e2f-1d82baa6b597"
FilePathsBase = "48062228-2e41-5def-b9a4-89aafe57970f"
JSON3 = "0f8b85d8-7281-11e9-16c2-39a750bddbf1"
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
StructTypes = "856f2bd8-1eba-4b0a-8007-ebc267875bd4"
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"

[targets]
test = ["Test", "Random", "JSON3", "StructTypes", "CategoricalArrays"]
test = ["CategoricalArrays", "FilePathsBase", "JSON3", "Random", "StructTypes", "Test"]
12 changes: 11 additions & 1 deletion docs/src/manual.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,16 @@ The best place to learn about the Apache arrow project is [the website itself](h

The [Arrow.jl](https://github.com/JuliaData/Arrow.jl) Julia package is another implementation, allowing the ability to both read and write data in the arrow format. As a data format, arrow specifies an exact memory layout to be used for columnar table data, and as such, "reading" involves custom Julia objects ([`Arrow.Table`](@ref) and [`Arrow.Stream`](@ref)), which read the *metadata* of an "arrow memory blob", then *wrap* the array data contained therein, having learned the type and size, amongst other properties, from the metadata. Let's take a closer look at what this "reading" of arrow memory really means/looks like.

## Support for generic path-like types

Arrow.jl attempts to support any path-like type wherever a function takes a path as an argument. The Arrow.jl API should generically work as long as the type supports:

- `Base.open(path, mode)::I where I <: IO`

When a custom `IO` subtype is returned (`I`) then the following methods also need to be defined:

- `Base.read(io::I, ::Type{UInt8})` or `Base.read(io::I)`
- `Base.write(io::I, x)`

## Reading arrow data

Expand Down Expand Up @@ -173,7 +183,7 @@ Ok, so that's a pretty good rundown of *reading* arrow data, but how do you *pro

### `Arrow.write`

With `Arrow.write`, you provide either an `io::IO` argument or `file::String` to write the arrow data to, as well as a Tables.jl-compatible source that contains the data to be written.
With `Arrow.write`, you provide either an `io::IO` argument or a [`file_path`](#support-for-generic-path-like-types) to write the arrow data to, as well as a Tables.jl-compatible source that contains the data to be written.

What are some examples of Tables.jl-compatible sources? A few examples include:
* `Arrow.write(io, df::DataFrame)`: A `DataFrame` is a collection of indexable columns
Expand Down
7 changes: 2 additions & 5 deletions src/table.jl
Original file line number Diff line number Diff line change
Expand Up @@ -24,11 +24,8 @@ ArrowBlob(bytes::Vector{UInt8}, pos::Int, len::Nothing) = ArrowBlob(bytes, pos,

tobytes(bytes::Vector{UInt8}) = bytes
tobytes(io::IO) = Base.read(io)
function tobytes(str)
f = string(str)
isfile(f) || throw(ArgumentError("$f is not a file"))
return Mmap.mmap(f)
end
tobytes(io::IOStream) = Mmap.mmap(io)
tobytes(file_path) = open(tobytes, file_path, "r")

struct BatchIterator
bytes::Vector{UInt8}
Expand Down
6 changes: 3 additions & 3 deletions src/write.jl
Original file line number Diff line number Diff line change
Expand Up @@ -53,11 +53,11 @@ function write end

write(io_or_file; kw...) = x -> write(io_or_file, x; kw...)

function write(filename::String, tbl; metadata=getmetadata(tbl), colmetadata=nothing, largelists::Bool=false, compress::Union{Nothing, Symbol, LZ4FrameCompressor, ZstdCompressor}=nothing, denseunions::Bool=true, dictencode::Bool=false, dictencodenested::Bool=false, alignment::Int=8, maxdepth::Int=DEFAULT_MAX_DEPTH, ntasks=Inf, file::Bool=true)
open(filename, "w") do io
function write(file_path, tbl; metadata=getmetadata(tbl), colmetadata=nothing, largelists::Bool=false, compress::Union{Nothing, Symbol, LZ4FrameCompressor, ZstdCompressor}=nothing, denseunions::Bool=true, dictencode::Bool=false, dictencodenested::Bool=false, alignment::Int=8, maxdepth::Int=DEFAULT_MAX_DEPTH, ntasks=Inf, file::Bool=true)
open(file_path, "w") do io
write(io, tbl, file, largelists, compress, denseunions, dictencode, dictencodenested, alignment, maxdepth, ntasks, metadata, colmetadata)
end
return filename
return file_path
end

function write(io::IO, tbl; metadata=getmetadata(tbl), colmetadata=nothing, largelists::Bool=false, compress::Union{Nothing, Symbol, LZ4FrameCompressor, ZstdCompressor}=nothing, denseunions::Bool=true, dictencode::Bool=false, dictencodenested::Bool=false, alignment::Int=8, maxdepth::Int=DEFAULT_MAX_DEPTH, ntasks=Inf, file::Bool=false)
Expand Down
29 changes: 27 additions & 2 deletions test/runtests.jl
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
# limitations under the License.

using Test, Arrow, ArrowTypes, Tables, Dates, PooledArrays, TimeZones, UUIDs,
CategoricalArrays, DataAPI
CategoricalArrays, DataAPI, FilePathsBase
using Random: randstring

include(joinpath(dirname(pathof(ArrowTypes)), "../test/tests.jl"))
Expand Down Expand Up @@ -71,6 +71,30 @@ end

end # @testset "arrow json integration tests"

@testset "abstract path" begin
# Make a custom path type that simulates how AWSS3.jl's S3Path works
struct CustomPath <: AbstractPath
path::PosixPath
end

Base.read(p::CustomPath) = read(p.path)

io = Arrow.tobuffer((col=[0],))
tt = Arrow.Table(io)

mktempdir() do dir
p = Path(joinpath(dir, "test.arrow"))
Arrow.write(p, tt)
@test isfile(p)

tt2 = Arrow.Table(p)
@test values(tt) == values(tt2)

tt3 = Arrow.Table(CustomPath(p))
@test values(tt) == values(tt3)
end
end # @testset "abstract path"

@testset "misc" begin

# multiple record batches
Expand Down Expand Up @@ -167,7 +191,8 @@ tt = Arrow.Table(Arrow.tobuffer(t))
@test tt.a == ["aaaaaaaaaa", "aaaaaaaaaa"]

# 49
@test_throws ArgumentError Arrow.Table("file_that_doesnt_exist")
@test_throws SystemError Arrow.Table("file_that_doesnt_exist")
@test_throws SystemError Arrow.Table(p"file_that_doesnt_exist")

# 52
t = (a=Arrow.DictEncode(string.(1:129)),)
Expand Down

2 comments on commit a3eec89

@omus
Copy link
Contributor Author

@omus omus commented on a3eec89 Oct 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JuliaRegistrator
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registration pull request created: JuliaRegistries/General/47749

After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.

This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:

git tag -a v2.2.0 -m "<description of version>" a3eec89b51f712d916e7c6c8de78153de3430417
git push origin v2.2.0

Please sign in to comment.