-
-
Notifications
You must be signed in to change notification settings - Fork 16
/
09-geojoins.qmd
258 lines (189 loc) · 8.71 KB
/
09-geojoins.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
# Geospatial joins
```{julia}
#| echo: false
#| output: false
import Pkg
Pkg.activate(".")
using GeoStats
import CairoMakie as Mke
```
Another important tool in geospatial data science is the **geospatial join**.
We will introduce the concept with a practical example, and will explain how it
is related to the standard [join](https://en.wikipedia.org/wiki/Join_(SQL)) of
two tables.
## Motivation
The split-apply-combine pattern that we learned in the previous chapter requires a
**single** geotable with all the relevant information in it. However, many questions
in geospatial data science can only be answered with information that is spread across
**multiple** geotables. Hence, the need to **join** these geotables before attempting
to `@groupby`, `@transform` and `@combine` the information.
Let's consider a simple example where we are given two geotables, one containing people
who shared their latitude and longitude coordinates:
```{julia}
table = (
NAME=["John", "Mary", "Paul", "Anne", "Kate"],
AGE=[34.0, 12.0, 23.0, 39.0, 28.0]u"yr",
HEIGHT=[1.78, 1.56, 1.70, 1.80, 1.72]u"m",
LATITUDE=[-22.96710361241228, 37.42773662442142, -27.486220858775997, 39.90358408375064, -3.847311538763359],
LONGITUDE=[-43.17891118844475, -122.17007072663823, 153.04380578036657, 116.40764745941036, -32.411372812211226]
)
people = georef(table, (:LONGITUDE, :LATITUDE)) |> Proj(PlateCarree)
```
And another containing countries in the world according to
the [Natural Earth](https://www.naturalearthdata.com) project:
```{julia}
using GeoIO
countries = GeoIO.load("data/countries.geojson", numbertype = Float64) |> Proj(PlateCarree)
```
::: {.callout-note}
The `georef` function has a method that accepts the names of
columns with geospatial coordinates. In this example, the `table`
already has the `LATITUDE` and `LONGITUDE` coordinates, which we
project to a `PlateCarree` CRS as discussed in the chapter
[Map projections](06-projections.qmd).
:::
::: {.callout-note}
The `numbertype` of the coordinates is specified to avoid the default
floating point type used by the GeoJSON backend, which is `Float32`.
:::
Let's visualize the geometries of these two geotables:
```{julia}
fig = Mke.Figure()
ax = Mke.Axis(fig[1,1], title = "People and countries",
xlabel = "Easting [m]", ylabel = "Northing [m]")
viz!(countries.geometry)
viz!(people.geometry, color = "teal", pointsize = 10)
fig
```
Our goal in this example is to attach the "COUNTRY" and "REGION" information
to each individual based on their location, represented as a point. In other
words, we want to find the `countries` that **contain the location**, and then
copy the columns to the `people`.
## Joining geotables
The `geojoin` function can be used to join two geotables using a **geometric predicate**
function:
```{julia}
geojoin(people, countries, pred = ∈)
```
In the example above, we used the `∈` predicate to check if a `Point` in the `people`
geotable is in a `MultiPolygon` in the `countries` geotable. The default predicate
in `geojoin` is the `intersects` function that we covered in previous chapters.
Notice that Kate's "COUNTRY" and "REGION" are `missing`. That is because Kate lives
in the [Fernando de Noronha](https://en.wikipedia.org/wiki/Fernando_de_Noronha) island,
which is not present in the `countries` geotable. To retain just those people for which
the geometric predicate evaluates `true`, we can use a different `kind` of `geojoin`
known as `:inner`:
```{julia}
geojoin(people, countries, kind = :inner)
```
By default, the `:left` kind is used. Like in standard join, the `geojoin` is not
commutative. If we swap the order of the geotables, the result will be different:
```{julia}
geojoin(countries, people, kind = :inner)
```
To learn about the different `kind`s of join, check DataFrames.jl's
[documentation on joins](https://dataframes.juliadata.org/stable/man/joins).
::: {.callout-note}
In database-style join, the predicate function `isequal` is applied to arbitrary
columns of tables. In `geojoin`, **geometric predicate** functions are applied to
the `geometry` column of geotables.
The `tablejoin` function can be used for database-style join between a geotable
and a simple table (e.g., `DataFrame`). The result will be a new geotable over a
subset of geometries from the first argument. Please check the documentation for
more details.
:::
::: {.callout-note}
## Tip for all users
Besides `geojoin`, the framework also provides vertical and horizontal concatenation
of geotables. Vertical concatenation is achieved with `vcat` as follows:
```{julia}
gt1 = georef((a=[1,2,3], b=[4,5,6]))
gt2 = georef((b=[4,5,6], c=[7,8,9]))
vcat(gt1, gt2)
```
All columns are preserved by default, which corresponds to the option `kind = :union`.
Absent columns are filled with `missing` values. The option `kind = :intersect` can be
used to retain only the columns that are present in all geotables:
```{julia}
vcat(gt1, gt2, kind = :intersect)
```
Horizontal concatenation is achieved with `hcat` as follows:
```{julia}
hcat(gt1, gt2)
```
Unique column names are produced with underscore suffixes whenever a column appears in
multiple geotables. Julia provides convenient syntax for `vcat` and `hcat`:
```{julia}
[gt1
gt2]
```
```{julia}
[gt1 gt2]
```
:::
## Common predicates
In geospatial data science, the most common geometric predicates used in `geojoin`
are `∈`, `⊆`, `==`, `≈` and `intersects` as illustrated in @tbl-predicate. Specific
applications may require custom predicates, which can be easily defined in pure Julia
with the Meshes.jl module. For example, it is sometimes convenient to define geometric
predicates in terms of a distance and a threshold:
```{julia}
pred(g1, g2) = evaluate(Euclidean(), centroid(g1), centroid(g2)) ≤ 1500u"km"
geojoin(people, countries, kind = :inner, pred = pred)
```
| PREDICATE | EXAMPLE |
|:----------:|:----------------------------:|
| ∈ | ![in](images/in.png) |
| ⊆ | ![in](images/subseteq.png) |
| == | ![in](images/isequal.png) |
| ≈ | ![in](images/isapprox.png) |
| intersects | ![in](images/intersects.png) |
: Common geometric predicates in geospatial join {#tbl-predicate}
## Multiple matches
Sometimes the predicate function will evaluate `true` for multiple rows of the geotables.
In this case, we need to decide which value will be copied to the resulting geotable. The
`geojoin` accepts reduction functions for each column. For example, suppose that Kate
decided to move to the continent:
```{julia}
table = (
NAME=["John", "Mary", "Paul", "Anne", "Kate"],
AGE=[34.0, 12.0, 23.0, 39.0, 28.0]u"yr",
HEIGHT=[1.78, 1.56, 1.70, 1.80, 1.72]u"m",
LATITUDE=[-22.96710361241228, 37.42773662442142, -27.486220858775997, 39.90358408375064, -9.66628224039543],
LONGITUDE=[-43.17891118844475, -122.17007072663823, 153.04380578036657, 116.40764745941036, -35.71261407423411]
)
people = georef(table, (:LONGITUDE, :LATITUDE)) |> Proj(PlateCarree)
```
Now, both John and Kate live in Brazil according the `countries` geotable:
```{julia}
geojoin(people, countries)
```
If we swap the order of the geotables in the `geojoin`, we need to decide how to reduce
the multiple values of "NAME", "AGE" and "HEIGHT" in the resulting geotable. By default,
the `mean` function is used to reduce `Continuous` variables, and the `first` function
is used otherwise:
```{julia}
geojoin(countries, people, kind = :inner)
```
We can specify custom reduction functions using the following syntax:
```{julia}
geojoin(countries, people, "AGE" => maximum, "HEIGHT" => mean, kind = :inner)
```
::: {.callout-note}
The keyword arguments `kind` and `pred` must appear at the end of the `geojoin`:
```julia
geojoin(gt1, gt2, var1 => red1, ..., varn => redn; kind = :left, pred = intersects)
```
:::
## Congratulations!
Congratulations on finishing **Part III** of the book. Let's quickly review what we learned so far:
- In order to answer geoscientific questions with `@groupby`, `@transform` and `@combine`,
we need a single geotable with all the relevant information. This single geotable is often
the result of a `geojoin` with multiple geotables from different sources of information.
- There are various `kind`s of `geojoin` such as `inner` and `left`, which match the behavior
of standard join in databases. We recommend the DataFrames.jl documentation to learn more.
- The `geojoin` produces matches with **geometric predicate** functions. The most common are
illustrated in @tbl-predicate, but custom predicates can be easily defined in pure Julia
using functions defined in the Meshes.jl module.
In the next part of the book, you will learn a final important tool that is missing in our toolkit
for advanced geospatial data science. The concepts in the following chapters are extremely important.