forked from idiv-biodiversity/2019_big_data_biogeography
-
Notifications
You must be signed in to change notification settings - Fork 0
/
tue_download_GBIF_data.Rmd
136 lines (99 loc) · 7.22 KB
/
tue_download_GBIF_data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---
title: "Downloading occurrences from GBIF"
output: html_document
---
```{r global_options, include=FALSE}
knitr::opts_chunk$set(fig.width=12, fig.height=8, eval = FALSE,
echo=TRUE, warning=FALSE, message=FALSE,
tidy = TRUE, collapse = TRUE,
results = 'hold')
```
## Background
The public availability of species distribution data has increased substantially in the last 10 years: occurrence information, mostly in the form of geographic coordinate records for species across the tree of life, representing hundreds of years of biological collection effort are now available. The global Biodiversity Information Facility (www.gbif.org) is one of the largest data providers, hosting more than one billion records (Sept 2018) from a large variety of sources.
## Outcomes
After this exercise you will be able to retrieve species occurrence information from GBIF from within R. You will be equipt with example data from your group of interest for the follow upcoming exercises. See https://ropensci.org/tutorials/rgbif_tutorial.html for a more exhaustive tutorial on the rgbif package.
## Exercise
We will use the rgbif package to obtain occurrence records from GBIF. You can find the relevant functions for each task in the parentheses. You can get help on each function by typing `?FUNCTIONNAME`.
1. Familiarize yourself with the `rgbif` package and download the occurrence data for one of the species of your choice. (`name_suggest`, `occ_Search`)
2. Look at the data and a quick plot (`head`, `plot`).
3. Now find out how many occurrences for flowering plants are availble from Brazil. (`name_suggest`, `occ_count`)
4. Download all records for flowering plants from Rio Grande do Norte (more or less). Make sure to keep an eye on the `limit` argument. (`occ_search`)
5. Save the downloaded data as .txt or .csv to the working directory. (`write_csv`, `write_delim`)
## Library setup
In this exercise we will use the rgbif library for communication with GBIF and the tidyverse library for data management.
```{r setup, include=FALSE}
library(rgbif)
library(tidyverse)
library(rgdal)
library(rgeos)
```
# Tutorial
In the following tutorial, we will go through the questions one-by-one. The suggested answers are by no means the only correct ones.
## 1. Download data for a single species we saw in the filedand save them in a data.frame
GBIF hosts a large number of records and downloading all records might take some time (also the download limit using `occ_search` is 250,000), so it is worth checking first how many records are available. We do this using the `return` argument of the `occ_search` function, which will only return meta-data on the record. Chose a species from your project taxon, for demonstration will download records for the Malvaceae family. We'll first download data for a single, wide-spread species, _Ceiba pentandra_:
```{r occ_count}
#Search occurrence records
dat <- occ_search(scientificName = "Wittmackia patentissima",
return = "data", limit = 1000)
```
## 2. Explore the downloaded data
```{r}
nrow(dat) # Check the number of records
head(dat) # Check the data
plot(dat$decimalLatitude ~ dat$decimalLongitude) # Look at the georeferenced records
```
So luckily there are a good number of records available. An as the quick visualization shows, a lot of the have geographic coordinates. See exercise eight for more detailed plotting. In the next exercise we will see how to reduce the amount of information and quality check the data. But let’s first download more relevant data for the project.
## 3. How many records are available for your group of interest?
For your project, we are interested not only in one species, but a larger taxonomic group. You can search for higher rank taxa using GBIF's taxonKey. The taxonKey is a unique identifier for each taxon; we can obtain it from the taxon name via the `name _suggest` function. Since higher taxa might have a lot of records and downloading might take a lot of time, we will first check how many records are available. Here we will look at the entire genus *Ceiba*.
```{r}
#Use the name_suggest function to get the gbif taxon key
tax_key <- name_suggest(q = "Magnoliopsida", rank = "Class")
#Sometimes groups have multiple taxon keys, in this case three, so we will check how many records are available for them
lapply(tax_key$key, "occ_count")
#Here the firsrt one is relevant, check for your group!
tax_key <- tax_key$key[1]
```
```{r}
occ_count(tax_key, country = "DE")
```
There are more than five million records available from Brazil. This is too much for this exercise and also `occ_Search` is limited to 200000 records. Hence we will further limit the geographic extent. To do this you can use the Well-known-text format (WKT) to specify an area. Here we use a very simple rectangle, feel free to experiment. The download may take some minutes.
4. Download all data for your group from Brazil, or if the group is very large from part of that area.
You can now download the records for Brazil. Since we want to work with coordinates, we will only download those records that do have coordinates.
```{r}
dat <- occ_search(taxonKey = tax_key, return = "data",
country = "BR", hasCoordinate = T, limit = 1000)
```
That leaves us with records. If you are satisfied for your group you can go to the next step and save the data to the working directory. The limit for record searching using rgbif is 250,000 records, if your group has more records you may limit the geographic area to the north east of Brazil. To do this you can use the Well-known-text format (WKT) to specify an area. Here we use a very simple rectangle, feel free to experiment
```{r, eval=FALSE}
study_a <- "POLYGON((-35 -4.5, -38.5 -4.5, -38.5 -7, -35 -7, -35 -4.5))"
dat_ne <- occ_search(taxonKey = tax_key, return = "data", hasCoordinate = T,
geometry = study_a, limit = 1000)
```
If you have a .kml or.shp file for which you want to download records you can import this into R using the `readOGR` function of the `rgdal` library and convert it into WKT format using `writeWKT` from the `rgeos` package.
```{r, eval=FALSE}
amz <- readOGR("inst/Amazonia.kml")
#or for shape files:
#amz <- readOGR("inst", layer = "Amazonia")
rgeos::writeWKT(amz)
# Or, best use the extent of the shape, since it is simple:
ex <- raster::extent(amz)
ex <- as(ex, 'SpatialPolygons')
ex <- rgeos::writeWKT(ex)
```
Alternatively, you can download data for a list of taxa.
```{r, eval=FALSE}
gen_list <- c("Ceiba", "Eriotheca")
tax_key <- lapply(gen_list, function(k){name_suggest(q = k, rank = "Genus")})
tax_key <- unlist(lapply(tax_key, "[[", "key"))
unlist(lapply(tax_key, "occ_count"))
dat_ne <- occ_search(taxonKey = tax_key, return = "data", hasCoordinate = T,
limit = 1000, country = "BR")
dat_ne <- lapply(dat_ne, "as.data.frame")
dat_ne <- bind_rows(dat_ne)
```
5. Save the downloaded data as .csv to the working directory.
You can save the results as text file to the working directory. To identify your working directory use `getwd()`.
```{r}
write_csv(dat_ne, path = "inst/gbif_occurrences.csv")
```
If you want to use records from GBIF for publication, please make sure you cite them properly, using a DOI, you can get a DOI by using `occ_download`.