R is a free software environment for statistical computing and graphics, and can be downloaded here. RStudio is an integrated development environment for R, which in short makes R easier to use. Rstudio can be downloaded here.
For this session, you will need both R and RStudio.
R is an object orientated programming language. You can assign values to an object with the name of your choosing, for example my_color <- "red"
. When doing so, the object is stored in the computers memory. The object my_color
will then hold the value "red"
, and can be called by typing: my_color
.
In RStudio, one can create a project to store all scripts and notes from one project in one folder. After starting RStudio, go to file
and New project
. In the window, click New directory
, New project
, and type in the name of the project folder, for example Ggtree tutorial
. Choose which directory to create the project in, and click create project
.
Now, RStudio will restart, and the working directory of R will be in the folder you specified above. The advantage of dividing your work into projects is that all your scripts and files can be easily accessed in the project, or subfolders of the project, making it easier to have control of your code. The project folders may also be under version control with git.
Now we are ready to install some packages!
NOTE: In the code boxes below you will see comments starting with a
#
. This is not code, but helpful comments that explain the arguments and functions found in the box.
In R, the fundamental unit of shareable code is a package. There is a myriad of packages available for download, and the ones published on the Comprehensive R Archive Network CRAN are trustworthy and of high quality. Packages can be created by anyone, and can also be hosted on GitHub. Some packages, specialized for biological analyses, are hosted by Bioconductor, and have their own installation method. Note that all installed libraries are available for activation regardless of which project folder you are in.
To install packages from CRAN, use the following command:
install.packages("package_name")
For this session, you will need the following packages
ape
: website, manualcluster
: documentationtibble
: vignettes (vignettes are nearly equivalent to manual pages.)dplyr
: websiteggtree
: vignettes. We have to install this package from GitHub (see below)distanceR
Also need to be installed from GitHub: Haukons' GitHub
To be able to install packages from GitHub, you will need the following package:
devtools
: manual hosted on GitHub
To install packages from github, you need to use the following functions:
# Loading the packages we will need in the R environment with the "library" function
# NB: loading a package makes the functions defined in the package available in R
# This only has to be done once per session
library(devtools)
install_github("GuangchuangYu/ggtree")
install_github("hkaspersen/distanceR")
The data files used in this session can be downloaded below (left click "download as..." on the link). Copy or download directly into the project folder you created earlier.
NB: if you are working on a Linux machine, you can download those files from the command line with: wget <url_file>
Based on the allele data in the cgMLST file, one can calculate the percentage of similar "labels" in a pairwise fashion. The labels in the cgMLST data represent the loci variant, based on the nucleotide sequence (same as regular 7 gene MLST).
In other words, the first sample in the data is compared to the second, and the percentage of similar labels is calculated (excluding the loci if one label is missing). The first sample is then compared to the next sample, until all sample pairs are compared.
The result of this calculation is represented in a symmetrical dissimilarity matrix, which is N x N big. Diagonals of the matrix represent each isolate compared to iself, and is therefore set as 0, as there are no dissimilarities in the comparison. The rest of the comparisons result in values between 1 and 0, where values closer to zero represent a more similar comparison.
This data can then be clustered hierachically (here by the UPGMA method), and a tree object can be created from the resulting clusters.
To do this in R, one has to import and clean the cgMLST data first:
# loading the packages we will need in the R environment
library(tibble)
library(dplyr)
cgMLST_data <- read.table("cgMLST.tsv", # The cgMLST data file name
sep = "\t", # The separator in the file, here tab
colClasses = "factor", # Set all columns in the data to factor
header = TRUE) %>% # Does the file contain column names?
na_if("0") %>% # change all "0" to NA
column_to_rownames("FILE") # Change rownames to the values in the "FILE" column
Here you see the %>%
operator. This operator is a pipe, and sends the output from one function to the next, similar to |
in bash.
Note that a specific data structure is needed to calculate distances correctly. If an allele is missing, it should be represented by NA
, which is why we change all 0
to NA
. Also note that the row names need to have the sample ID's, which can be found in the "FILE" column. A factor is a categorical variable.
To calculate distances from the data and create a tree:
library(cluster)
library(ape)
# calculate distances
distances <- daisy(cgMLST_data, metric = "gower")
# cluster the samples from the dissimilarity matrix
clust_dist <- hclust(distances, method = "average")
# create tree object
tree <- as.phylo(clust_dist)
# to do all three calculations at once you can nest your commands within each other:
tree <- as.phylo(hclust(daisy(cgMLST_data, metric = "gower"), method = "average"))
Check out the function information with ?daisy
and ?hclust
to see details on the metric and method arguments
To then visualize this tree, we can either use the plot()
function from base R, or use ggtree()
function from ggtree package.
NB: base R is a set of packages that are loaded automatically when you start R
What are the advantages of using ggtree?
- using base plot() function allows us to have a fast view at our tree object
- ggtree has advanced anotation functions that allow us to add layers, and follows a specific syntax that allows to do advanced graphics "automatically and easy". This special syntax is called "grammar of graphics" which is the gg from the ggtree or ggplot2 package.
What are layers?:
- Imagine that I made a basic graphic on a white A4 paper. We can add labels, annotations, colors, title with layers: think of a layer as a transparent sheet where you only draw one type of information (ex: leaves labels) and you put this transparent sheet on top of your basic tree. You can now see your tree annotated with the leaves labels.
library(ggtree)
plot(tree)
ggtree(tree)
In base R, the tree will have sample name labels on the tips. In ggtree, nothing was specified, so only the tree was plotted. Ggtree have multiple layouts to choose from, i.e. "circular", "rectangular", "fan", "equal_angle" etc. See the vignette on ggtree above for more information. To add more information to the tree with ggtree, we can use additional layers of information, like this:
ggtree(tree,
layout = "rectangular") +
geom_tiplab()
Now, the isolate names have been added to the tree tip points (leaves labels). However, if you want to combine metadata such as sequence types, animal species etc. to the tree and visualize it, one first has to connect the tree and the metadata:
# Import metadata
metadata <- read.table("tree_metadata.txt", sep = "\t", header = TRUE)
# Plot tree
# The '%<+%' operator links the data to the tree
ggtree(tree,
layout = "rectangular") %<+% metadata +
geom_tiplab(aes(label = ST))
Now, the ST column in the metadata is plotted in the tree instead of the isolate names. Important: For this to work, one has to make sure that the first column in the metadata file has exact matches to the tip labels in the tree object. In other words, if one sample is named "2014-01-2355" in the tree, and the same isolate is named "68-2014-01-2355" in the metadata, the information will not be plotted for that isolate. Therefore, to check the tip labels on the tree, one can type tree$tip.labels
to see what they look like. Note that the order of the samples doesn't matter in the metadata file, just the names of the samples.
Notice that the aes()
function was added to the geom_tiplab()
function. This links the column "ST" in the metadata to the corresponding samples in the tree, plotting the ST where the specific sample is placed in the tree. Anything written outside the aes()
function isn't linked to the metadata in any way. Thus, to adjust the size of the labels, one can type in geom_tiplab(aes(label = ST), size = 3)
.
Now, with the metadata file in hand, one can add more information to the tree, for example colored tip nodes:
# Plot tree
ggtree(tree,
layout = "rectangular") %<+% metadata + # links your metadata to the graph
geom_tiplab(aes(label = ST)) + # add a labels layer (for the leaves), labels are taken from ST column
geom_tippoint(aes(color = species)) # add a points layer, each species represented by its own color
ggtree()
will here plot the animal species in the metadata as colored nodes on the tree. Since no specific palette is specified, it will use default ggplot2 coloring. If you want to specify colors, a specific palette need to be created:
library(ggplot2)
# create palette
palette <- c("Broiler" = "#4575b4",
"Pig" = "#74add1",
"Red fox" = "#f46d43")
# Plot tree and assign to object
my_tree <- ggtree(tree,
layout = "rectangular") %<+% metadata +
geom_tiplab(aes(label = ST)) +
geom_tippoint(aes(color = species)) +
scale_color_manual(values = palette) # here we specify the palette
# To look at the tree, type the name of the object:
my_tree
This will then plot the tree with the specified colors. Further adjustments to the look of the tree can be made by adding arguments like 'size', 'alpha' (transparency), 'offset' (distance of labels from the tree) and others, on each layer. Make sure to put these arguments outside the aes()
function (unless you want it to represent something in your metadata).
Adding a legend to the tree if it does not show-up automatically: look at ?theme
for all options.
Ex: add the following line to commands specifying my_tree
: + theme(legend.position = "bottom")
NB: If you did not save your script, you can run R_tree_course_script to catch up fast up to this point
We have now created an annotated tree connected to our metadata. However, one can also add a heatmap to the outside of the tree, representing for example presence/absence of genes. To do this, we will use the function gheatmap
from ggtree. NOTE: The heatmap data need to have the rownames as the sample names, unlike the metadata file, which had the first column as the sample names. We fix this when importing the table into R below.
NB: a heatmap is a plot where colors are associated to values in your data (ex: white will be high value, red will be low value). If you plot a heatmap of a matrix, you will see low and high values as a gradient of colors.
# Import heatmap data
heatmap_data <- read.table("tree_heatmap_data.txt",
sep = "\t",
header = TRUE,
stringsAsFactors = FALSE) %>%
mutate_at(vars(-id),
funs(as.character)) %>% # change columns to character (-id from dataframe representation, generates a named list)
column_to_rownames("id") # set row names as the values in the column "id" (passes the named list to rownames)
gheatmap(my_tree, # your tree
heatmap_data, # the heatmap data
offset = 0.07, # distance of heatmap from tree tips
width = 0.5, # the width of the heatmap
font.size = 3) # the font size of the column headers
Take a look at the tree, and see that the column headers looks like they are all jumbled up. We can fix this by changing the direction of the column headers and placing them on top of the plot:
gheatmap(my_tree,
heatmap_data,
offset = 0.07,
width = 0.5,
font.size = 3,
colnames_position = "top",
colnames_angle = 90)
Similar to the tree annotation above, one can use a specific color palette of your choosing to represent the data.
NB: type colors()
to see an example of colors you can use
palette2 <- c("0" = "grey95",
"1" = "steelblue")
complete_tree <- gheatmap(my_tree,
heatmap_data,
offset = 0.07,
width = 0.5,
font.size = 3,
colnames_position = "top",
colnames_angle = 90) +
scale_fill_manual(values = palette2)
complete_tree
Now you can save the tree with high resolution with the ggsave
function.
ggsave("complete_tree.tiff", # filename of your choosing
complete_tree, # your tree
device = "tiff", # what kind of file should it be saved as?
dpi = 300, # resolution of the tree
units = "cm", # units to use in the height and width arguments
height = 20, # height of the image
width = 20) # width of the image
The image file can be found in your project folder.
Available file formats you can be accessed with ?Devices
, ex: bmp, jpg, pdf ...
Finally, there is a lot of functionality in ggtree and other functions that isn't discussed here. use ?ggtree
and ?gheatmap
in R to see additional arguments and possibilities for different visualization of trees.
NB: missing graphical device? you can add some libraries that will extend the possibilities of the ggsave
function. Search for format and R, ex: for
vectorized format: package svglite
, package devEMF
for enhanced metafile (emf+/emf) graphics (libreoffice and microsoft office) ....
The distanceR package make it easier to calculate trees, annotate and add heatmaps without the use of all the functions above. The package is available on Håkon's github page. In short, the functions can be used as follows:
library(distanceR)
# calculate tree
tree <- calc_tree("cgMLST.tsv",
metric = "gower",
method = "average")
# note that in the function above, the file in the
# folder is specified, not an object in R. That is
# because the function imports the data in the correct
# format and then runs distance calculation.
# Annotate tree with metadata
annotated_tree <- annotate_tree(tree,
"tree_metadata.txt",
layout = "rectangular",
label_variable = "ST", # The labels on the tips ("ST": column name in the metadata)
color_variable = "species") # The colored nodes
# Similar to above, the file in the folder is specified as the metadata.
# Additional settings can be added, such as point size, font size etc.
# See `?annotate_tree` for details.
# Add heatmap to the tree
add_heatmap(annot_tree,
"testdata/heatmap_data.txt",
layout = "rectangular",
colnames_position = "top",
colnames_angle = 90,
colnames_offset = 0.5,
font_size = 2,
heatmap_offset = 0.1)
Some usefull functions to get further with annotations (for plotting all kind of trees - independent of distance based method to reconstruct tree):
We assume that you want to add more layers to your graphical representation (my_tree object) of the tree that we build above
- adding internal nodes labels (ie can be used for branch support) with
geom_text
andgeom_text2
functions. Ex:
# add label for each internal node (number) - ease for selection afterwards
# with adjustements of label position (horizontal, vertical and size)
my_tree + geom_text2(aes(subset=!isTip, label=node), hjust=1.2, vjust =-1, size = 1)
# an idea for the bootstrap values (if name of variable containing bootstraps is bottstrap)
my_tree + geom_text(aes(label=bootstrap))
It is possible to import trees that were created which clustering/phylogenetic softwares.
The treeio
package, that has been developped from parsing different trees format into R.
And have been designed to import both tree and metadata associated with the tree.
As written in data Integration, Manipulation and Visualisation of phylogenetic trees in chapter: linking ananotation data to tree with tidytree
Treeio supports importation of:
Newick
,Nexus
,New Hampshire eXtended format (NHX)
,jplace
andPhylip
- outputs from programs:
BEAST
,EPA
,HyPhy
,MrBayes
,PAML
,PHYLDOG
,pplacer
,r8s
,RAxML
andRevBayes
.
Example import:
mytree <- read.newick("file_name") # import the tree in R and assign it to an object of class phylo, with name mytree
plot(mytree) # an very simple way to check that your import succeded (but do not expect it to look nice)
NB: Treeio also offers usefull functions such as merging trees from different sources and exporting trees Example export:
write.beast(tree_object_in_R, file ="export_file.nex")
For tidytree , treeio and ggtree -> many more functions are described in data Integration, Manipulation and Visualisation of phylogenetic trees
Some usefull examples of trees manipulation with ggtree
ggplot2 has some additional functions that can be used to annotate trees (note that: ggtree package is based on ggplo2)
ape
has also many more functions to manipulate trees. See: Analysis of Phylogenetics and Evolution with R.
... and there are many more ...
CRAN Task view R packages sorted by themes
There is no overview of all the R-packages hosted on GitHub, so we need some googling
to find them.
A good introduction for learning how to use trees data in R: data Integration, Manipulation and Visualisation of phylogenetic trees. (We used it a lot to make this page.)