_targets.Rmd

---
title: "Target Markdown"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
```

Target Markdown is a powerful R Markdown interface for reproducible
analysis pipelines, and the chapter at
<https://books.ropensci.org/targets/markdown.html> walks through it in
detail.

# Packages

The example requires several R packages that can be installed using the
remotes package:

```{r install-dependencies, eval=FALSE}
# These only need to be run outside of the UKBRAP
# install.packages("remotes")
# install.packages("renv")
# install.packages("targets")
# remotes::install_github("steno-aarhus/ukbAid")

# Everything should have been setup in the UKBRAP if ukbAid 
# instructions were followed.
renv::restore()
```

# Setup

Near the top of the document, you may also wish to remove the
`_targets_r` directory previously written by non-interactive runs of the
report. Otherwise, your pipeline may contain superfluous targets.

```{r}
library(targets)
tar_unscript()

# Use this function to delete target's entire cache. Is a hard reset.
# tar_destroy()
```

# Globals

We first define some global options/functions common to all targets.

```{targets target-globals, tar_globals = TRUE}
set.seed(65)
options(tidyverse.quiet = TRUE)
library(tidyverse)
pkgs_to_load <- desc::desc_get_deps() %>% 
    filter(type == "Imports") %>% 
    pull(package)
# Get and set packages that pipeline depends on.
targets::tar_option_set(packages = pkgs_to_load)
here::here("R") %>% 
    fs::dir_ls(glob = "*.R") %>% 
    walk(source)
```

# Targets

## Project setup

```{targets download_project_data}
tar_target(
    name = download_project_data,
    command = ukbAid::download_project_data(),
    format = "file"
)
```

## Pre-processing outside of targets

Initial processing of the UK Biobank to get into the necessary state
needed for this project. The end goal is to save the dataset at the
storage location.

```{r, eval=FALSE}

# Then run the wrangling function:
# wrangled_data <- ukb_wrangle_and_save(ukb_data_raw)

# But really, this function is used to save the dataset to its location:
# ukb_wrangle_and_save(ukb_data_raw, .save = TRUE)
```

## Processing within targets

Next we need to process the dataset for exact use in this project and
load it into the environment. We don't want to save this dataset, since
it is already in its remote location.

```{targets data_file_path}
tar_target(
    name = data_file_path,
    command = check_if_data_exists(),
    format = "file"
)
```

Let's import the data from the storage location, without removing
exclusions.

```{targets project-data-pre-exclusions}
tar_target(
    name = project_data_pre_exclusions, 
    command = ukb_import_project_data(data_file_path)
)
```

Now we can remove participants based on the exclusion criteria and other
removals that we found during exploration.

```{targets project-data}
tar_target(
    name = project_data,
    command = ukb_remove_exclusions(project_data_pre_exclusions)
)
```

## Basic descriptive characteristics

Since we want to create a CONSORT diagram of who we removed from the
original UK Biobank dataset, we send the consort data into the plot
function.

TODO: Exclude also those with T2DM and HbA1c above 48?

```{targets consort-diagram}
tar_target(
    name = consort_diagram,
    command = project_data_pre_exclusions %>%
        ukb_remove_exclusions(for_consort_diagram = TRUE) %>%
        plot_consort_diagram(save_plot = TRUE),
    format = "file"
)
```

We'll run some descriptive statistics on the full data as well as by
sex. This will be saved into the `data/` folder.

```{targets descriptive-statistics}
tar_target(
    name = descriptive_statistics,
    command = analysis_descriptive_statistics(project_data),
    format = "file"
)
```

TODO: Plots of exposures and outcomes, total and by gender

```{r}
plot_data_prep <- project_data %>% 
    select(contains("height"), contains("leg"), sex, mtb_glycated_haemoglobin_hba1c) %>% 
    mutate(leg_height_ratio = leg_height_ratio * 100) %>% 
    pivot_longer(-sex) %>% 
    filter(name == "leg_height_ratio")

# Maybe use https://www.rdocumentation.org/packages/ggcorrplot/versions/0.1.1/topics/ggcorrplot

ggplot(plot_data_prep, aes(x = value, y = ..count..)) +
    geom_histogram(data = select(plot_data_prep, -sex), aes(x = value, fill = "all"), bins = 20) +
    geom_histogram(data = plot_data_prep, aes(x = value, fill = sex), bins = 20) +
  # ggridges::geom_density_line(
  #   data = select(plot_data_prep, -sex), aes(fill = "all"),
  #   color = "transparent", adjust = 1/2
  # ) + 
  # ggridges::geom_density_line(aes(fill = sex), bw = 2, adjust = 1/2, color = "transparent") +
  # scale_x_continuous(limits = c(0, 75), name = "passenger age (years)", expand = c(0, 0)) +
  # scale_y_continuous(limits = c(0, 26), name = "scaled density", expand = c(0, 0)) +
  scale_fill_manual(
    values = c("#b3b3b3a0", "#D55E00", "#0072B2"),
    breaks = c("all", "Male", "Female"),
    labels = c("all", "Male  ", "Female"),
    name = NULL,
    guide = guide_legend(direction = "horizontal")
  ) +
  coord_cartesian(clip = "off") +
  facet_grid(cols = vars(sex)) +
    theme_minimal() +
  theme(
    axis.line.x = element_blank(),
    strip.text = element_text(size = 14, margin = margin(0, 0, 0.2, 0, "cm")),
    legend.position = "bottom",
    legend.justification = "right",
    legend.margin = margin(4.5, 0, 1.5, 0, "pt"),
    legend.spacing.x = grid::unit(4.5, "pt"),
    legend.spacing.y = grid::unit(0, "pt"),
    legend.box.spacing = grid::unit(0, "cm"),
  )

             # labeller = labeller(sex = function(sex) paste(sex, "passengers")))

```

TODO: Plot of all metabolic variables (distributions?) (Use function
from above)

```{r}
project_data %>% 
    select(starts_with("mtb") -mtb_glycated_haemoglobin_hba1c)
```

TODO: Correlation heatmap of metabolic variables

```{r}

cors <- function(data) {
    correlation_matrix <- Hmisc::rcorr(as.matrix(data))
    Mdf <- map(M, ~ data.frame(.x)) 
    return(Mdf)
}


formatted_cors <- function(df) {
    cors(df) %>%
        map( ~ rownames_to_column(.x, var = "measure1")) %>%
        map( ~ pivot_longer(.x, -measure1, "measure2")) %>%
        bind_rows(.id = "id") %>%
        pivot_wider(names_from = id, values_from = value) %>%
        mutate(
            sig_p = ifelse(P < .05, T, F),
            p_if_sig = ifelse(P < .05, P, NA),
            r_if_sig = ifelse(P < .05, r, NA)
        )
}

formatted_cors(mtcars) %>%
    ggplot(aes(
        measure1,
        measure2,
        fill = r,
        label = round(r_if_sig, 2)
    )) +
    geom_tile() +
    labs(
        x = NULL,
        y = NULL,
        fill = "Pearson's\nCorrelation",
        title = "Correlations in Mtcars",
        subtitle = "Only significant Pearson's correlation coefficients shown"
    ) +
    scale_fill_gradient2(
        mid = "#FBFEF9",
        low = "#0C6291",
        high = "#A63446",
        limits = c(-1, 1)
    ) +
    geom_text() +
    theme_classic() +
    scale_x_discrete(expand = c(0, 0)) +
    scale_y_discrete(expand = c(0, 0)) +
    theme(text = element_text(family = "Roboto"))

corr_results <- project_data %>% 
    select(starts_with("mtb"), -mtb_glycated_haemoglobin_hba1c) %>% 
    correlation::correlation(redundant = TRUE, p_adjust = "fdr")

# corr_results %>% 
#     correlation::cor_lower() %>% 
    
```

## Processing for NetCoupler

There are some things we need to process for NetCoupler to work
effectively. Plus some data is largely missing, so we're removing those
columns. Then we do an initial split to have a training set and an
(eventual) testing set.

```{targets project-data-for-nc}
tar_target(
    name = project_data_for_nc, 
    command = ukb_wrangle_for_nc(project_data)
)
```

While we do an initial split of the dataset, we also want to create some
cross-validation splits to 1) make it easier for NetCoupler to run on
the full dataset and 2) to aggregate the results across the splits to
hopefully get a better estimate of the network and modeling.

```{targets project-data-for-nc-cv}
tar_target(
    name = project_data_for_nc_cv, 
    command = project_data_for_nc %>% 
        rsample::training() %>% 
        create_cv_splits()
)
```

```{r playground}
library(tidyverse)

testing_sampling <- project_data_for_nc %>% 
    rsample::training() %>% 
    create_cv_splits()
testing_network <- testing_sampling %>% 
    generate_network_results()
    str()

```

Next is to set up the

TODO: Make DAG of underlying causal relationships (in relation to
confounders)

-   Update confounders based on Christina's paper
    -   Smoking? Physical activity? Ethnic background, education,
        income, anthropometrics?
    -   Make DAG.

## NetCoupler analysis

```{targets network-results}
tar_target(
    name = network_results,
    # Generate network results as rda file in data/
    command = generate_network_results(project_data_for_nc_cv$splits),
    format = "file"
)
```

TODO:

-   Network results by sex/gender
-   Exposure and outcome link results by sex/gender
-   (In manuscript Rmd instead of targets?) Plot of network results by
    sex/gender as well as total.
-   (In manuscript Rmd instead of targets?) Plot of linkage results by
    sex/gender as well as total.
-   List of variables and their meta data (from UK Biobank)

And eventually make the manuscript:

```{r}
# tar_render(report, "report.Rmd")
```

# Pipeline

If you ran all the `{targets}` chunks in non-interactive mode, then your
R scripts are set up to run the pipeline.

```{r}
tar_make()
# Don't need or want the project data in the target cache.
tar_delete(starts_with("project_data"))
```

# Output

You can retrieve results from the `_targets/` data store using
`tar_read()` or `tar_load()`.

```{r, message = FALSE}
library(biglm)
tar_read(fit)
```

```{r}
tar_read(hist)
```

The `targets` dependency graph helps your readers understand the steps
of your pipeline at a high level.

```{r}
tar_visnetwork()
```

At this point, you can go back and run `{targets}` chunks in interactive
mode without interfering with the code or data of the non-interactive
pipeline.