-
Notifications
You must be signed in to change notification settings - Fork 0
/
12_Working_with_Files.qmd
528 lines (425 loc) · 21.6 KB
/
12_Working_with_Files.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
---
title: "Working with Files"
---
## Reading in Data
There are many, many different libraries to manipulate various formats of text and binary files. Here we shall look only at the built-in Julia methods, and the most popular package for reading in large amounts of data - CSV.jl.
For cases where you may want to store binary data (such as storing variables to file), the JLD2[^1] package is very convenient.
[^1]: The data is stored in a format compatible with the HDF5 standard and the package and read most files created by other programs in HDF5 format. HDF, or Hierarchical Data Format is a standard created by the US National Center for Supercomputing Applications.
For processing the data, there are again a near-infinite number of options. The most popular, and extremely powerful, choice is to put the data in a DataFrame object. We shall look at DataFrames in more detail in the next section.
### Using Files in Julia - Open, Read and Write
You can do simple file access through the base Julia commands `open`, `read/readline(s)` and `write`:
To open a file in write mode:
``` julia
f = open("filename.txt", "w")
# IOStream(<file filename.txt>)
write(f, "Hello world.\n")
# 13
close(f)
```
::: callout-note
Note that, unlike when printing to the console, there is no print() and println() versions that do or do not add a new line. When writing to a file, you explicitly add the newline (`\n`) in the string you are writing.
:::
To open a file in read mode:
``` julia
f = open("filename.txt", "r")
# IOStream(<file filename.txt>)
s = readlines(f)
# 1-element Vector{String}:
# "Hello world."
```
To open a file in append mode:
``` julia
f = open("filename.txt", "a")
# IOStream(<file filename.txt>)
write(f, "Hello back.\n")
# 12
close(f)
f = open("filename.txt", "r") # or just f = open("filename.txt")
# IOStream(<file filename.txt>)
s = readlines(f)
2-element Vector{String}:
# "Hello world."
# "Hello back."
```
Opening a file in read, write and append mode is fairly straight-forward. The object returned is an `IOStream`. There are several ways to interact with this object.
- `readline()`: Reads the next line in a file and return it as a `String`
- `readlines()`: This reads the entire file, interprets the contents as `Strings` and returns an array with each line a separate entry.
- `read()`: This reads the entire file, interprets the contents as data (UInt8, single byte values), e.g.
``` julia
ss = read(f)
# 25-element Vector{UInt8}:
# 0x48
# 0x65
# 0x6c
# 0x6c
# 0x6f
# 0x20
# 0x77
# 0x6f
# 0x72
# 0x6c
# 0x64
# 0x2e
# 0x0a
# 0x48
# 0x65
# 0x6c
# 0x6c
# 0x6f
# 0x20
# 0x62
# 0x61
# 0x63
# 0x6b
# 0x2e
# 0x0a
Char.(ss) # Convert the UInt8 data to Char to get better display in Julie
# 25-element Vector{Char}:
# 'H': ASCII/Unicode U+0048 (category Lu: Letter, uppercase)
# 'e': ASCII/Unicode U+0065 (category Ll: Letter, lowercase)
# 'l': ASCII/Unicode U+006C (category Ll: Letter, lowercase)
# 'l': ASCII/Unicode U+006C (category Ll: Letter, lowercase)
# 'o': ASCII/Unicode U+006F (category Ll: Letter, lowercase)
# ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
# 'w': ASCII/Unicode U+0077 (category Ll: Letter, lowercase)
# 'o': ASCII/Unicode U+006F (category Ll: Letter, lowercase)
# 'r': ASCII/Unicode U+0072 (category Ll: Letter, lowercase)
# 'l': ASCII/Unicode U+006C (category Ll: Letter, lowercase)
# 'd': ASCII/Unicode U+0064 (category Ll: Letter, lowercase)
# '.': ASCII/Unicode U+002E (category Po: Punctuation, other)
# '\n': ASCII/Unicode U+000A (category Cc: Other, control)
# 'H': ASCII/Unicode U+0048 (category Lu: Letter, uppercase)
# 'e': ASCII/Unicode U+0065 (category Ll: Letter, lowercase)
# 'l': ASCII/Unicode U+006C (category Ll: Letter, lowercase)
# 'l': ASCII/Unicode U+006C (category Ll: Letter, lowercase)
# 'o': ASCII/Unicode U+006F (category Ll: Letter, lowercase)
# ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
# 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
# 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
# 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
# 'k': ASCII/Unicode U+006B (category Ll: Letter, lowercase)
# '.': ASCII/Unicode U+002E (category Po: Punctuation, other)
# '\n': ASCII/Unicode U+000A (category Cc: Other, control)
```
- `write()`: Write a string to the file.
::: callout-important
A reminder again, that you need to explicitly add new lines characters, `\n`. A call to `write(f, "Line 1", "Line 1")` will write the text "Line 1Line 2" to a single line in the file.
:::
The use of `open()` is optional for reading and overwriting a file (not appending). You could simply supply the filename to `read()/readline()/readlines()` and `write()`:
``` julia
write("filename.txt", "This is some text.\n")
# 19
readlines("filename.txt")
# 1-element Vector{String}:
# "This is some text."
```
Where `open()` comes in handy, is when you want to manipulate the contents of the file with a function. Combining `open()` with a `do` block is the most common way of doing this:
``` julia
as = collect(1:10)
# 10-element Vector{Int64}:
# 1
# 2
# 3
# 4
# 5
# 6
# 7
# 8
# 9
# 10
open("data.txt", "w") do f
for a in as
write(f, string(a)*"/n")
end
end
b = Int64[]
# Int64[]
open("data.txt", "r") do f
for l in eachline(f)
push!(b, parse(Int64, l))
end
end
b
# 10-element Vector{Int64}:
# 1
# 2
# 3
# 4
# 5
# 6
# 7
# 8
# 9
# 10
```
Some new things to note in this example: `eachline()` returns an iterator that contains the lines of the file. We can loop through the lines using this and the lines only get read when needed - very useful for extremely large files.
`parse()`, interprets a string as the type that is passed as the first parameter - `Int64` in this case. It will error if the interpretation is not possible:
``` julia
parse(Float64, "this is not a numeric value")
# ERROR: ArgumentError: cannot parse "this is not a numeric value" as Float64
```
For more information, see the [manual](https://docs.julialang.org/en/v1/base/io-network/)
### The DelimitedFiles Standard Library
While using the built-in file I/O functions are useful for simple text files, we more often work with larger files containing data. The [`DelimitedFiles`](http://delimitedfiles.juliadata.org/dev/) package is useful for working with small to medium (in the Mb range, not Gb) files that contain rows and columns of data. This package was a Julia standard library up to v1.9.0, when it was spun out as a separate package. The intent is to do this with more of the standard libraries to allow them to be developed faster and updated in-between Julia versions.
The benefit of `DelimitedFiles` over alternatives, like `CSV` is that it is lightweight. It does not have the functionality of CSV, nor the speed with larger files. When you only want to read in a small file, however, the additional compile time for `CSV` is more of a burden than a blessing. This is where `DelimitedFiles` shines.
There are only two function in the package:
- `readdlm()`: Read a delimited file
- `writedlm()`: Write a delimited file
In order to accommodate a large number of optional parameters, the package declares several versions of `readdlm`:
``` julia
readdlm(source, delim::AbstractChar, T::Type, eol::AbstractChar; header=false, skipstart=0, skipblanks=true, use_mmap, quotes=true, dims, comments=false, comment_char='#')
readdlm(source, delim::AbstractChar, eol::AbstractChar; options...)
readdlm(source, delim::AbstractChar, T::Type; options...)
readdlm(source, delim::AbstractChar; options...)
readdlm(source, T::Type; options...)
readdlm(source; options...)
```
These parameters do the following:
- `source`: The source filename as a string, or a stream object.
- `delim`: The character used as a delimiter, such as `','`, or `'\t'` (a tab character). Note the single quotes indicating a `Char`, not a `String`.
- `T`: The type of the data. If not specified, the function will interpret the data to identify the type and may return a heterogeneous array. If `T` is a numeric type, non-numeric entries will be interpreted as `NaN` for floating point types, or zero.
- `eol`: The end-of-line character, typically `'\n'`.
- `header`: If `true`, the first row is read as column headings and the function returns a tuple `(data_cells, header_cells)`, rather than just `data_cells`
- `skipstart`: An integer value, indicating the number of lines to skip at the start
- `skipblanks`: If `true`, skip blank lines
- `use_mmap`: Use a memory map to access the file. This could speed up large file access, but must be used with caution on Windows - only when reading once and never when writing to the file.
- `quotes`: If true, column entries that are enclosed in double quotes may contain end-of-line and delimiter characters. Double quote characters inside the quote must be escaped with another double quote(`""`)
- `dims`: A tuple, `(rows, columns)`, that estimated the size of the data. This can speed up things for large files as sufficient memory is allocated in a single block.
- `comments` and `comment_char`: If `comments` is true, lines starting with `comment_char` and text after a `comment_char` in a line are ignored.
The write option is a lot simpler and only takes the file to write to, the data, the delimiter and then the keyword arguments from `readdlm()`:
``` julia
writedlm(f, A, delim='\t'; opts)
```
Here, the only option that is currently used, is `quotes` to indicate that quoted strings can contain end-of-line and delimiter characters.
Some examples:
``` julia
a = collect(1:5)
# 5-element Vector{Int64}:
# 1
# 2
# 3
# 4
# 5
b = collect(2:2:10)
# 5-element Vector{Int64}:
# 2
# 4
# 6
# 8
# 10
writedlm("data.csv", [a b], ',')
readlines("data.csv")
# 5-element Vector{String}:
# "1,2"
# "2,4"
# "3,6"
# "4,8"
# "5,10"
data = readdlm("data.csv", ',')
# 5×2 Matrix{Float64}:
# 1.0 2.0
# 2.0 4.0
# 3.0 6.0
# 4.0 8.0
# 5.0 10.0
data = readdlm("data.csv", ',', Int64)
# 5×2 Matrix{Int64}:
# 1 2
# 2 4
# 3 6
# 4 8
# 5 10
```
### FileIO.jl and JLD2.jl
The `FileIO` package is a common framework for reading and writing files that is used by many other packages, such as `JLD2`.
`FileIO` supplies `load` and `save` and will identify the file's type from the extension. The actual code for a given file type is implemented by the package that uses `FileIO`.
There is a long list of file types and the packages that implement `load` and `save` for them in the [`FileIO` documentation](https://juliaio.github.io/FileIO.jl/stable/registry/). You can simply use `FileIO` and the package will call the correct package to save or load your data or file. That package must of course also be installed in your project.
One of these packages is `JLD2`. It implements `save` and `load` from `FileIO` for generic Julia variables. `JLD2` replaces the original `JLD` and is often hugely faster. `JLD` is still around, but you probably don't want to use it.
#### Using JLD2
For consistency over many file types, we shall look at the `FileIO` interface implemented by JLD2. You can either just install and use `JLD2` or you can install both `FileIO` and `JLD2`, then just use `FileIO`. If you are only going to deal one or two file types, then you may prefer only installing the specific packages, rather than deal with FileIO. Each package will support `load` and `save` functions in addition to their internal functions.
##### JLD2 with `load` and `save`
The `FileIO` specification requires you to supply a name for each variable you save. This can either by via creating a `Dict`[^2], or by passing the names and variables sequentially as parameters:
[^2]: A dictionary, or a collection of name and value pairs.
``` julia
using JLD2
struct MyData
x
y
end
data = MyData(rand(5), rand(5))
# MyData([0.41915922256751215, 0.36513861729204666, 0.922892254146376, 0.12902554672750943, 0.2285766214336168], [0.8030027668439638, 0.2007295612353277, 0.6996873161379902, 0.7449547510169909, 0.5305104381235525])
v = rand(10)
# 10-element Vector{Float64}:
# 0.7689384578959101
# 0.7408163205271128
# 0.9655957120143325
# 0.3581479242990463
# 0.28719219030844134
# 0.6645105539839383
# 0.8936175723328116
# 0.22088721590210036
# 0.3338736118785931
# 0.6492950330159202
save("data.jld2", Dict("data" => data, "vector" => v))
save("data2.jld2", "data", data, "vector", v)
```
The last two statements are equivalent.
To read the file, you can either read the whole dictionary, or specify the entries you want (using the name you specified when saving):
``` julia
load("data.jld2")
# Dict{String, Any} with 2 entries:
# "vector" => [0.768938, 0.740816, 0.965596, 0.358148, 0.287192, 0.664511, 0.893618, 0.220887, 0.333874, 0.649295]
# "data" => MyData([0.419159, 0.365139, 0.922892, 0.129026, 0.228577], [0.803003, 0.20073, 0.699687, 0.744955, 0.5305…
dat = load("data.jld2", "data")
MyData([0.41915922256751215, 0.36513861729204666, 0.922892254146376, 0.12902554672750943, 0.2285766214336168], [0.8030027668439638, 0.2007295612353277, 0.6996873161379902, 0.7449547510169909, 0.5305104381235525])
data.x == dat.x #Check against the original object created earlier
# true
data.y == dat.y
# true
```
Here the `struct`, `MyData` is already defined, but if you read a data file in a fresh instance of Julia without defining it, you will see that `JLD2` reconstructs the custom type for you:
``` julia
using JLD2
data = load("data.jld2", "data")
# ┌ Warning: type Main.MyData does not exist in workspace; reconstructing
# └ @ JLD2 C:\Users\Braam\.julia\packages\JLD2\ryhNR\src\data\reconstructing_datatypes.jl:495
# JLD2.ReconstructedTypes.var"##Main.MyData#292"([0.41915922256751215, 0.36513861729204666, 0.922892254146376, 0.12902554672750943, 0.2285766214336168], [0.8030027668439638, 0.2007295612353277, 0.6996873161379902, 0.7449547510169909, 0.5305104381235525])
typeof(data)
# JLD2.ReconstructedTypes.var"##Main.MyData#292"
```
You won't be able to create new objects of the type, however, as the constructors are not also recreated. You will however be able to access the data.
``` julia
data.x
# 5-element Vector{Float64}:
# 0.41915922256751215
# 0.36513861729204666
# 0.922892254146376
# 0.12902554672750943
# 0.2285766214336168
data.y
# 5-element Vector{Float64}:
# 0.8030027668439638
# 0.2007295612353277
# 0.6996873161379902
# 0.7449547510169909
# 0.5305104381235525
```
### CSV.jl
CSV, or comma separated values files are a very common way of storing data. You can save an Excel worksheet as a CSV file, and then process that further in Julia.
The `CSV` package is one of the fastest (often the fastest, but things can change with new versions of other packages) ways of reading large (VERY large) CSV files in any language[^3].
[^3]: See https://www.zdnet.com/article/programming-languages-julia-touts-its-speed-edge-over-python-and-r/ as an example. Here Julia and CSV was up to 22x faster than R's `fread` and both R and Julia were faster than Pandas (Python)
The `CSV` package has a multitude of features. We are only going to look at the most commonly used ones here. You would however spend your time well in reading the full [documentation](https://csv.juliadata.org/stable/index.html) to see other options, like reading data directly from a zip or g-zipped file.
We are also going to assume the most common use case, that your data is read into or written from a DataFrame object. There is a separate section on DataFrames.
#### Writing a CSV file
Writing a DataFrame to a CSV file is simple. You just call `CSV.write(filename, dataframe, keyword options)`
``` julia
using CSV, DataFrames
df = DataFrame(a = rand(10), b = rand(10))
# 10×2 DataFrame
# Row │ a b
# │ Float64 Float64
# ─────┼──────────────────────
# 1 │ 0.48043 0.94456
# 2 │ 0.0665074 0.677552
# 3 │ 0.789794 0.396974
# 4 │ 0.0412975 0.987218
# 5 │ 0.456003 0.789401
# 6 │ 0.295094 0.985048
# 7 │ 0.837373 0.654643
# 8 │ 0.378567 0.632108
# 9 │ 0.890707 0.700569
# 10 │ 0.00709744 0.637061
CSV.write("data.csv", df)
# "data.csv"
```
There are many options that can be passed as keyword arguments. We shall look at only the more commonly used ones:
- `delim`: A character (or string) that specifies the delimiter character. Default to a comma.
- `quotechar`: A character that specifies what quote character should be used to wrap strings that contain end-of-line, delimiting characters.
- `missingstring`: A string that will be written in the place of `missing` values.
- `dateformat`: A date format string for `Date` and `DateTime` values.
- `append`: If `true`, will append to an existing file. Defaults to `false`.
- `header`: A list of column names to replace those of the input table or DataFrame
- `decimal`: The character to use for decimals, Defaults to `'.'`.
There are several more in the [documentation](https://csv.juliadata.org/stable/writing.html#CSV.write).
#### Reading a CSV file
The easiest way to read a CSV file is via `CSV.read()`. This function allows you to specify a sink - the type the data should be cast into.
``` julia
using CSV, DataFrames
df = CSV.read("data.csv", DataFrame)
# 10×2 DataFrame
# Row │ a b
# │ Float64 Float64
# ─────┼──────────────────────
# 1 │ 0.48043 0.94456
# 2 │ 0.0665074 0.677552
# 3 │ 0.789794 0.396974
# 4 │ 0.0412975 0.987218
# 5 │ 0.456003 0.789401
# 6 │ 0.295094 0.985048
# 7 │ 0.837373 0.654643
# 8 │ 0.378567 0.632108
# 9 │ 0.890707 0.700569
# 10 │ 0.00709744 0.637061
```
You can also pass keyword options, like for `CSV.write()`:
- `header`:
- When passed an integer, this is the number of the line that contains the column names. Lines before this are considered comments.
- If a vector of integers are passed, these rows will be concatenated to determine the column names. - A vector of names (either strings or symbols) can also be passed to specify the names. Don't do this if there are names in the file, unless you skip that line with `skipto`
- `header` can be set to either zero or `false` to auto-generate column names (Column1, Column2...).
- If commented or empty rows are present, counting starts at the first non-commented/non-empty row.
- `normalizenames`:
- When set to `true`, this will replace spaces in names with underscores and any other processing that is needed to generate valid Julia identifiers.
- Defaults to `false`.
- `skipto`:
- Jump to the specified line (an integer) and start reading there.
- Can be used to skip the column names and replace them with names specified in `header`.
- Note that commented and empty rows (if `ignoreemptyrows` is specified) are **not** counted.
- `footerskip`:
- Skip the specified number of lines at the end of the file.
- Commented rows do not count, nor empty rows if `skipemptyrows` is specified.
- `transpose`:
- Transpose the file - rows become columns etc.
- `comment`:
- A string that specifies which rows are commented in the file. Any row beginning with this is considered a comment.
- `ignoreemptyrows`:
- If `true`, empty lines will be skipped.
- Note that this can influence the count in `skipto` and `header`.
- Defaults to `true`.
- `select`:
- Pass a vector of integers, symbols, strings or `Bool`s to indicate which **columns** to read.
- Can also pass a predicate function (i, name) -\> keep::Bool Only functions for which the function returns `true` are kept.
- `drop`:
- The inverse of `select`. Indicate which columns to skip.
- `limit`:
- The maximum number of rows to read.
- Combine with `skipto` to only read a part of the file.
- `missingstring`:
- Specifies a string that indicates `missing` values. Often this will be `NA` when the data file was generated by R.
- `delim`:
- A character used to separate the columns.
- Defaults to `','`.
- `ignorerepeated`:
- If `true`, consecutive delimiters are treated as a single one.
- Use with caution, as consecutive delimiters can also be used to show a missing value from a column. Some files, however, use fixed column widths and pad with delimiters, such as spaces.
- `quoted`:
- Indicate whether quoted strings are present
- `quotechar`:
- Indicate the character used as quotation mark.
- Quoted strings can include end-of-line and delimiter characters.
- `dateformat`:
- A date format string for Date and DateTime columns
- See Dates.DateFormat in the Julia [documentation](https://docs.julialang.org/en/v1/stdlib/Dates/#Dates.DateFormat).
- `decimal`:
- Indicate the decimal character
- Defaults to `'.'`.
- `truestrings` / `falsestrings`
- Vectors that specify strings that indicate `true` and `false` values, like "true", "TRUE", "T", "1", etc.
- `skipwhitespace`:
- If `true`, skip leading and trailing white space from values and column names
- `types`:
- A single type, vector or Dict of types to specify the types of each column, when you want to override the automatic detection of types.
- The Dict can link a column name (as string or symbol) or index to a type, e.g. `Dict(1 => Int64)`.
- Consider using `validate` with `types`.
- `validate`:
- Check that the data and specified types match up.
Other parameters specify the number of parallel thread to read in large files and how many lines should be processed to determine the types of each column, etc. Clearly `CSV` is a complex package with huge flexibility. Unfortunately, this usually means a bit of a learning curve for the users.