-
-
Notifications
You must be signed in to change notification settings - Fork 9
/
README.jmd
215 lines (157 loc) · 7.17 KB
/
README.jmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
# What is JDF.jl?
JDF is a `DataFrame`s serialization format with the following goals
* Fast save and load times
* Compressed storage on disk
* Enable disk-based data manipulation (not yet achieved)
* Supports machine learning workloads, e.g. mini-batch, sampling (not yet achieved)
JDF.jl is the Julia package for all things related to JDF.
JDF stores a `DataFrame` in a folder with each column stored as a separate file.
There is also a `metadata.jls` file that stores metadata about the original
`DataFrame`. Collectively, the column files, the metadata file, and the folder
is called a JDF "file".
`JDF.jl` is a pure-Julia solution and there are a lot of ways to do nifty things
like compression and encapsulating the underlying struture of the arrays that's
hard to do in R and Python. E.g. Python's numpy arrays are C objects, but all
the vector types used in JDF are Julia data types.
## Please note
The next major version of JDF will contain breaking changes. But don't worry I am fully committed to providing an automatic upgrade path. This means that you can safely use JDF.jl to save your data and not have to worry about the impending breaking change breaking all your JDF files.
## Example: Quick Start
```julia
using RDatasets, JDF, DataFrames
a = dataset("datasets", "iris");
first(a, 2)
```
### *Saving* and *Loading* data
By default JDF loads and saves `DataFrame`s using multiple threads starting from
Julia 1.3. For Julia < 1.3, it saves and loads using one thread only.
```julia
@time jdffile = JDF.save("iris.jdf", a)
@time a2 = DataFrame(JDF.load("iris.jdf"))
```
Simple checks for correctness
```julia
all(names(a2) .== names(a)) # true
all(skipmissing([all(a2[!,name] .== Array(a[!,name])) for name in names(a2)])) #true
```
### Loading only certain columns
You can load only a few columns from the dataset by specifying `cols =
[:column1, :column2]`. For example
```julia
a2_selected = DataFrame(JDF.load("iris.jdf", cols = [:Species, :SepalLength, :PetalWidth]))
```
The difference with loading the whole datasets and then subsetting the columns
is that it saves time as only the selected columns are loaded from disk.
### Some `DataFrame`-like convenience syntax/functions
To take advatnage of some these convenience functions, you need to create a variable of type `JDFFile` pointed to the JDF file on disk. For example
```julia
jdf"path/to/JDF.jdf"
```
or
```julia
path_to_JDF = "path/to/JDF.jdf"
JDFFile(path_to_JDF)
```
#### Using `df[col::Symbol]` syntax
You can load arbitrary `col` using the `df[col]` syntax. However, some of these operations are not
yet optimized and hence may not be efficient.
```julia
afile = JDFFile("iris.jdf")
afile[:Species] # load Species column
```
#### JDFFile is Table.jl columm-accessible
```julia
using Tables
ajdf = JDFFile("iris.jdf")
Tables.columnaccess(ajdf)
```
```julia
Tables.columns(ajdf)
```
```julia
Tables.schema(ajdf)
```
```julia
getproperty(Tables.columns(ajdf), :Species)
```
#### Load each column from disk
You can load each column of a JDF file from disk using iterations
```julia
jdffile = jdf"iris.jdf"
for col in eachcol(jdffile)
# do something to col
# where `col` is the content of one column of iris.jdf
end
```
To iterate through the columns names and the `col`
```julia
jdffile = jdf"iris.jdf"
for (name, col) in zip(names(jdffile), eachcol(jdffile))
# `name::Symbol` is the name of the column
# `col` is the content of one column of iris.jdf
end
```
#### Metadata Names & Size from disk
You can obtain the column names and number of columns `ncol` of a JDF, for
example:
```julia
using JDF, DataFrames
df = DataFrame(a = 1:3, b = 1:3)
JDF.save(df, "plsdel.jdf")
names(jdf"plsdel.jdf") # [:a, :b]
# clean up
rm("plsdel.jdf", force = true, recursive = true)
```
### Additional functionality: In memory `DataFrame` compression
`DataFrame` sizes are out of control. A 2GB CSV file can easily take up 10GB in
RAM. One can use the function `type_compress!(df)` to compress any
`df::DataFrame`. E.g.
```julia
type_compress!(df)
```
The function looks at `Int*` columns and see if it can be safely "downgraded" to
another `Int*` type with a smaller bits size. It will convert `Float64` to
`Float32` if `compress_float = true`. E.g.
```julia
type_compress!(df, compress_float = true)
```
`String` compression is _planned_ and will likely employ categorical encoding
combined with RLE encoding.
## Benchmarks
Here are some benchmarks using the [Fannie Mae Mortgage
Data](https://docs.rapids.ai/datasets/mortgage-data). Please note that a reading
of zero means that the method has failed to read or write.
JDF is a decent performer on both read and write and can achieve comparable
performance to [R's {fst}](https://www.fstpackage.org/), once compiled. The JDF
format also results in much smaller file size vs Feather.jl in this particular
example (probably due to Feather.jl's inefficient storage of `Union{String,
Missing}`).
![](benchmarks/results/fannie-mae-read-Performance_2004Q3.txt.png)
![](benchmarks/results/fannie-mae-write-Performance_2004Q3.txt.png)
![](benchmarks/results/fannie-mae-filesize-Performance_2004Q3.txt.png)
Please note that the benchmarks were obtained on Julia 1.3+. On earlier versions
of Julia where multi-threading isn't available, JDF is roughly 2x slower than as
shown in the benchmarks.
## Supported data types
I believe that restricting the types that JDF supports is vital for simplicity and maintainability.
There is support for
* `WeakRefStrings.StringVector`
* `Vector{T}`, `Vector{Union{Mising, T}}`, `Vector{Union{Nothing, T}}`
* `CategoricalArrays.CategoricalVetors{T}` and `PooledArrays.PooledVector`
where `T` can be `String`, `Bool`, `Symbol`, `Char`, `SubString{String}`, `TimeZones.ZonedDateTime` (experimental) and `isbits` types i.e. `UInt*`, `Int*`,
and `Float*` `Date*` types etc.
`RLEVectors` support will be considered in the future when `missing` support
arrives for `RLEVectors.jl`.
## Resources
[@bkamins](https://github.com/bkamins/)'s excellent [DataFrames.jl tutorial](https://github.com/bkamins/Julia-DataFrames-Tutorial/blob/master/04_loadsave.ipynb) contains a section on using JDF.jl.
## How does JDF work?
When saving a JDF, each vector is Blosc compressed (using the default settings)
if possible; this includes all `T` and `Unions{Missing, T}` types where `T` is
`isbits`. For `String` vectors, they are first converted to a Run Length
Encoding (RLE) representation, and the lengths component in the RLE are `Blosc`
compressed.
## Development Plans
I fully intend to develop JDF.jl into a language neutral format by version v0.4. However, I have other OSS commitments including [R's
{disk.frame}](http:/diskframe.com) and hence new features might be slow to come onboard. But I am fully committed to making JDF files created using JDF.jl v0.2 or higher loadable in all future JDF.jl versions.
## Notes
* Parallel read and write support is only available from Julia 1.3.
* The design of JDF was inspired by [fst](https://www.fstpackage.org/) in terms of using compressions and allowing random-access to columns