README.qmd

---
title: "README.md"
author: Sam A. Welch
date: 2024.05.09
format: 
  gfm: 
    toc: true
    code-fold: true
    code-summary: "Show code"
editor_options: 
  chunk_output_type: inline
---

This README.md file is a rendering of the .qmd document of the same name, written as part of the coding task for a postdoctoral position at NIVA. All code is my own work. The (indirect) assistance of the authors of the below packages in completing this project is gratefully acknowledged. This project was created using R version 4.4.0, and may not run properly under other versions.

## Setup

```{r packages, warning=FALSE, message=FALSE}
library(tidyverse)  # tidyverse packages for data cleaning, graphs, etc. 
library(webchem)    # access to chemical database APIs
library(readxl)     # read Excel files
library(sf)         # GIS functions
library(giscoR)     # Eurostat maps
library(cowplot)    # plots in grids
library(ggthemes)   # colour palettes

options(knitr.kable.NA = '-')
knitr::opts_chunk$set(dev = "ragg_png")
```

### Data

Data provided as part of the assignment is loaded and processed below. In addition, a .csv of abbreviated stressor names (author's own work, and not based on any standardised practice) created for this assignment is loaded in.

Code chunks and tables have been placed in collapsible elements to improve document readability. Click the "▶ Show code" (or similar) button to show the section.

```{r data, echo=FALSE, warning=FALSE}
exposure <- read_excel(path = "data/Exposure_data_AEP.xlsx", sheet = 2)

sites <- read_excel(path = "data/Exposure_data_AEP.xlsx", sheet = 3) |>
  # No samples included in the dataset were taken from Hotranelva.
  filter(SITE_NAME != "Hotranelva") |>
  # Order sites left to right
  arrange(LONGITUDE) |>
  mutate(SITE_NAME = fct_inorder(SITE_NAME))

all_data <- left_join(exposure, sites, by = "SITE_CODE")

mcpa_exposure <- all_data %>% filter(STRESSOR_ID == 21)

all_stressors <- all_data %>%
  select(STRESSOR_ID, STRESSOR_NAME, CAS, INCHIKEY) %>%
  unique() |>
  # Abbreviate chemical names for neater graphs
  left_join(y = read_csv("data/stressor_acronyms.csv", show_col_types = FALSE), by = "STRESSOR_NAME") |>
  select(STRESSOR_NAME, STRESSOR_ACRONYM)

# Verify/validate data
# Is the same unit used throughout?
if (length(all_data$MEASURED_UNIT %>% unique()) != 1) {
  print("Multiple units.")
}

# Set ggplot theme
theme_set(new = theme_few())
```

Next, maps of Europe from the `giscoR` package are loaded and cropped to the relevant area, and the provided site data is converted to an `sf` object for compatibility with `sf`'s GIS functions.

```{r gis_setup, warning=FALSE}
# Get a map of Europe
gisco_Europe <- gisco_get_countries(epsg = 4326, region = "Europe") 

sites_sf <- st_as_sf(sites, coords = c("LONGITUDE", "LATITUDE"), crs = 4326) |> 
  select(-COORDINATE_SYSTEM, -SITE_CODE)
```

## Task A: Import and Visualise Spatiotemporal Exposure Data

Measured concentrations for 40 stressors across 5 sites were imported. Concentrations of 2-methyl-4-chlorophenoxyacetic acid (MCPA), a phenoyx herbicide, were measured over a 5-month period in 2019, 1-8 times, site depending. Data were plotted as a scatter plot/line graph to show spatial (site) and temporal (sampling date) variation. 

```{r task_a, fig.width = 10, warning=FALSE, message=FALSE}

# Sampling Map
plot_a0 <- ggplot(gisco_Europe) +
  geom_sf() +
  geom_sf(data = sites_sf) +
  geom_sf_text(aes(label = NAME_ENGL), colour = "darkgrey") +
      geom_sf(data = sites_sf, aes(size = 6, colour = SITE_NAME)) +
  geom_sf_text(data = sites_sf, aes(label = c("T", "V", "H", "S", "M"))) +
  theme(legend.position = "none", axis.title = element_blank(), axis.ticks = element_blank(), axis.text = element_blank()) +
      scale_color_few() +
  coord_sf(xlim = c(5, 13), ylim = c(57, 61))

plot_a1 <- mcpa_exposure %>%
  ggplot(mapping = aes(x = SAMPLE_DATE, y = MEASURED_VALUE, colour = SITE_NAME)) +
  geom_line(size = 1.2) +
  geom_point(size = 7) +
  geom_text(aes(label = SITE_NAME |> substr(start = 1, stop = 1)), colour = "black") +
  scale_y_log10() +
  scale_color_few() +
  labs(
    x = "Sampling Date (2019)",
    y = "MCPA Concentration (μg/L)",
    colour = "Sampling Site",
    shape = "Sampling Site"
  ) +
  theme(legend.position = "none") +
   annotation_logticks(sides = "l")

task_a_plot <- ggdraw(plot = plot_a1) +
  draw_plot(plot_a0, x = 0.77, y = 0.6, width = 0.2, height = 0.3, scale = 1.5)
task_a_plot

```

Figure 1: Line graph/scatter plot of measured concentrations of 2-methyl-4-chlorophenoxyacetic acid (MCPA) in 5 freshwater sampling sites in Norway, between May 5th and October 14th, 2019. Each point represents a single sample. Inset map shows locations of sampling sites (T)imebekken, (V)asshaglona, (H)eiabekken, (S)kuterudbekken, and (M)ørdrebekken.

## Task B: Predict tissue concentrations of chemicals in fish

Summary statistics (mean, standard deviation, maximum, minimum, and 5th/95th percentiles) were calculated for each combination of site and stressor across the sampling period.

```{r task_b, warning=FALSE, message=FALSE}
all_data_summary <- 
all_data %>% 
  group_by(SITE_NAME, STRESSOR_NAME) %>% 
  summarise(n_MEASURED_VALUE = n(),
            Mean_MEASURED_VALUE = mean(MEASURED_VALUE),
            SD_MEASURED_VALUE = sd(MEASURED_VALUE),
            Max_MEASURED_VALUE = max(MEASURED_VALUE),
            Min_MEASURED_VALUE = min(MEASURED_VALUE),
            perc_5_MEASURED_VALUE = quantile(MEASURED_VALUE, probs = 0.05),
            perc_95_MEASURED_VALUE = quantile(MEASURED_VALUE, probs = 0.95)) |> 
  transmute(SITE_NAME,
            STRESSOR_NAME,
            n_MEASURED_VALUE,
            Mean_SD = case_when(is.na(SD_MEASURED_VALUE) ~ 
                                  paste0(Mean_MEASURED_VALUE |> round(2), "*"),
                              TRUE ~ 
                                  paste0(Mean_MEASURED_VALUE |> round(2), " ± ", SD_MEASURED_VALUE |> round(2))
                             ),
            Max_MEASURED_VALUE,
            Min_MEASURED_VALUE,
            perc_5_MEASURED_VALUE,
            perc_95_MEASURED_VALUE
         )

# Make a table using knitr
task_b_table <- knitr::kable(all_data_summary, digits = 2,
             col.names = c("Site", "Stressor", "n", "Mean ± SD", "Min.", "Max.", "5th Percentile", "95th Percentile"))

```

```{=html}
<details>
<summary>Show Table 1: Summary statistics of measured stressor concentrations in water</summary>
```
```{r task_b_table, echo=FALSE}
task_b_table
```

</details>

Table 1: Table of mean, standard deviation, maximum, minimum and percentile values of measured concentrations of 48 chemical stressors across 6 freshwater sampling sites in Norway, May 6th to October 28th 2019. *: <i>n</i> too small to calculate standard deviation. All values in μg/L, rounded to 2 d.p.


Tissue concentrations of stressors in fish were calculated as follows. First, the R package `webchem` was used to assign PubChem IDs to the stressors based on InChIKeys. These IDs were subsequently used to look up LogKOW (a.k.a XLogP) values on PubChem, which were available as predicted values for 39 substances, but not for spinosad. These values were then used to predict tissue concentrations following the equation:

$$C_f = C_w \times 10^{0.76 \times logP - 0.23}$$

Tissue concentration data is subsequently saved to `/data/Cf_summary_table.csv` and displayed below.

```{r concentration_fish, warning=FALSE, message=FALSE}
# Try and import the (already downloaded and saved) chem properties, try importing if it doesn't work
try_import_webchem <- try(webchem_chemicals <- read_csv(file = "data/webchem_chemical_data.csv"))
if (inherits(x = try_import_webchem, what = "try-error")) {
  print("Chemical data not found, importing from Pubchem via Webchem.")
  # Get the relevant CIDs from InChiKeys, then look up LogKOW/XLogP on Pubchem
  webchem_chemicals <- all_stressors |> 
  mutate(CID = get_cid(INCHIKEY, from = "inchikey", match = "first")$cid,
         XLogP = pc_prop(CID, properties = "XLogP")$XLogP)
  # Save to data to avoid unecessary API calls
  write_csv(x = webchem_chemicals, file = "data/webchem_chemical_data.csv")
}

# Predict tissue concentration in fish for all chemicals and sites
Cf_all_stressors <- all_data |> 
  left_join(webchem_chemicals |> select(-STRESSOR_ID, -INCHIKEY, -CAS), by = "STRESSOR_NAME") |> 
  mutate(FISH_CONC_uGKG = MEASURED_VALUE * 10 ^ (0.76 * XLogP - 0.23))

# Make a table summarising mean concentrations in fish by site and stressor
Cf_summary <- Cf_all_stressors |> 
    group_by(SITE_NAME, STRESSOR_NAME) %>% 
  summarise(Mean_MEASURED_VALUE = mean(MEASURED_VALUE),
            SD_MEASURED_VALUE = sd(MEASURED_VALUE)) |> 
  transmute(SITE_NAME,
            STRESSOR_NAME,
            Mean_SD = case_when(is.na(SD_MEASURED_VALUE) ~ 
                                  paste0(Mean_MEASURED_VALUE |> round(2), "*"),
                              TRUE ~ 
                                  paste0(Mean_MEASURED_VALUE |> round(2), " ± ", SD_MEASURED_VALUE |> round(2))
                             )
         ) |> 
  pivot_wider(values_from = Mean_SD, names_from = SITE_NAME)

# Make a pretty table using knitr
Cf_summary_table <- knitr::kable(Cf_summary, digits = 2,
             col.names = c("Stressor", "Heiabekken", "Mørdrebekken", "Skuterudbekken", "Timebekken", "Vasshaglona"))

write_csv(x = Cf_summary, file = "data/Cf_summary_table.csv")
```

```{=html}
<details>
<summary>Show Table 1: Summary statistics of predicted stressor concentrations in fish tissue</summary>
```
```{r Cf_summary_table, echo=FALSE}
Cf_summary_table
```

</details>

Table 2: Table of mean and standard deviation of predicted concentrations in fish tissues of 39 chemical stressors across 6 freshwater sampling sites in Norway, May 6th to October 28th 2019. *: n too small to calculate standard deviation. All values in μg/kg, calculated from measured water concentrations and predicted LogKOW, following the equation $C_f = C_w \times 10^{0.76 \times logP - 0.23}$.

## Task C: Visualize the values of Cw and Cf on a map

A purely map-based approach to this task was considered. However, given the number of stressors assessed, and the inability (with available data) of determining which concentrations were most environmentally relevant, it was decided that box plots with an accompanying map of sampling sites was a more appropriate approach.

```{r task_c, warning=FALSE, message=FALSE, fig.width = 10, fig.height=12}
#| fig-cap: 
#|   - "Speed and Stopping Distances of Cars"

task_c_data <- Cf_all_stressors |>
  left_join(sites_sf, by = "SITE_NAME") |>
  left_join(all_stressors, by = "STRESSOR_NAME") |>
  pivot_longer(cols = c(MEASURED_VALUE, FISH_CONC_uGKG), names_to = "Media", values_to = "Stressor_ug") |>
  mutate(Media = case_when(
    Media == "MEASURED_VALUE" ~ "Water (μg/L)",
    TRUE ~ "Fish (μg/kg)"
  ) |> factor(levels = c("Water (μg/L)", "Fish (μg/kg)"))
  )

task_c_map <-
  ggplot(data = gisco_Europe) +
  geom_sf() +
  geom_sf(data = sites_sf) +
  geom_sf_text(aes(label = NAME_ENGL), colour = "darkgrey") +
  geom_sf(data = sites_sf, aes(size = 7, colour = SITE_NAME)) +
  geom_sf_text(data = sites_sf, aes(label = c("T", "V", "H", "S", "M"))) +
  theme(legend.position = "none", axis.title = element_blank(), axis.ticks = element_blank(), axis.text = element_blank()) +
  coord_sf(xlim = c(5, 12), ylim = c(58, 64)) +
      scale_color_few()

task_c_worldmap <- ggplot(data = gisco_countries) +
  geom_sf() +
  geom_rect(aes(xmin = 5, xmax = 12, ymin = 58, ymax = 64), color = "red", fill = NA) +
    coord_sf(expand = FALSE) +
    theme(axis.text = element_blank(), axis.ticks = element_blank())

task_c_boxplots <- ggplot(data = task_c_data, mapping = aes(x = Stressor_ug, y = STRESSOR_ACRONYM, fill = SITE_NAME)) +
      geom_boxplot() +
      scale_x_log10() +
  scale_fill_few() +
      labs(
        x = "Measured Concentration",
        y = "Stressor (Abbreviated)") +
      annotation_logticks(sides = "b") +
      theme(legend.position = "none") +
  facet_grid(SITE_NAME ~ Media, scales = "free", space = "free", drop = TRUE, switch = "y") +
  theme(plot.margin = unit(c(0, 5, 0, 0), "cm"))

task_c_plot <- ggdraw(plot = task_c_boxplots) +
  draw_plot(task_c_map, x = 0.82, y = 0.715, width = 0.17, height = 0.3, scale = 1) +
    draw_plot(task_c_worldmap, x = 0.82, y = 0.55, width = 0.17, height = 0.3, scale = 1)

task_c_plot

```

Figure 2: Box plot of concentrations, in water (measured, μg/L) and fish tissue (predicted from water and LogKow, μg/kg), for 40 chemical stressors sampled April - October 2019 across 5 freshwater sites in southern Norway. Inset map shows locations of (T)imebekken, (V)asshaglona, (H)eiabekken, (S)kuterudbekken, and (M)ørdrebekken, as well as rough area of the world. Acronyms: p,p'-DDE: 1,1'-(2,2-Dichloro-1,1-ethenediyl)bis(4-chlorobenzene), BAM: 2,6-dichlorobenzamide (BAM), MCPA: 2-methyl-4-chlorophenoxyacetic acid (MCPA), ACNF: aclonifen, AXSB: azoxystrobin, BTZN: bentazone, BXFN: bixafen, BSCL: boscalid, CBDZ: carbendazim, CFVP: chlorfenvinphos, CLMZ: clomazone, CPRL: clopyralid, CZFM: cyazofamid, CPDN: cyprodinil, DCHP: dichlorprop, DFNC: diflufenican, FNMD: fenamidone, FPPM: fenpropimorph, FLSL: florasulam, FDXN: fludioxonil, FLXP: fluroxypyr, IMDC: imidacloprid, MDPM: mandipropamid, MCPP: mecoprop, MTLX: metalaxyl, MTMT: metamitron, METZ: metribuzin, PCCR: pencycuron, PNXD: pinoxaden, PPMC: propamocarb, PCAZ: propiconazole, PPCZ: propoxycarbazone, PCZ-D: prothioconazole-desthio, PCSB: pyraclostrobin, PRDF: pyridafol, PRSL: pyroxsulam, SPNS: spinosad, TBDZ: thiabendazole, THCP: thiacloprid, TOSB: trifloxystrobin.