In today's session, we will work on finding specific genes in our genomes. In this case, we have a set of genes that we're interested in, and we want to see if these are present in our genomes. If so, we might be interested in whether there are mutations in them. Example situations are to find the genes that are used for Multi Locus Sequence Typing (MLST), antibiotic resistance and virulence.
There are two types of tools that to this kind of thing:
- assemble genome, search for reference sequences in genomes
- align reads to reference sequences, and then extract the genes and variants from the results
As is evident, both approaches require a set of reference sequences that we compare to. This also means that these tools can not find sequences that are not included in the set of sequences that we compare to. This means that having updated sequences is important.
We will in this session use the program that is called ARIBA to do both MLST and AMR detection. Virulence finding is also possible with this program, but we will leave that as an exercise to the participants due to time constraits.
ARIBA uses the second of the two approaches above. ARIBA is far from the only program that does these things. However, we have landed on this program for two reasons:
- It gives very detailed results
- It enables us to use several databases, and to download the at any time current version of the database requested.
- Student presentation
- ARIBA paper
- ARIBA manual
- PUBMLST - MLST database
- CARD AMR database
- ResFinder AMR database
- Phandango site
ARIBA uses three different programs to do its work, these are cd-hit
,
mummer
and bowtie2
. To run ARIBA, we to activate them by using module load
.
ARIBA is a python
program. To ensure that it runs correctly on abel, we have
had to package it in a specific python
environment to avoid interactions with
other things on abel.
- Log in on abel, and go to your project home area and in to your course area.
- Make a directory called
specific_genes
next to the annotation and assembly directories, and go into it. - Start the
screen
program, and ask for a qlogin session for 3 hours and 4 cpus. - Module load these modules in this specific order:
cd-hit
,mummer
,bowtie2
. - Type in
source activate ariba211
. Note, this step might take a minute or two! - Explore the options to ariba
- ariba -h
- ariba getref -h
- ariba pubmlstget -h
- ariba run -h
- ariba summary -h
We will use two of the homework genomes for this session. We will work two and two, where each person in the pair will take one genome.
- Person A: pick F27
- Person B: pick S19
We are first going to do sequence typing for each of our genomes.
We will first explore the website and the MLST schemes a bit.
-
Go to the PUBMLST website mentioned above.
-
Click on
Download MLST definitions
. -
Scroll down, and find the two Escherichia coli schemes.
-
Question: Can you figure out how many genes are shared between the two schemes?
-
Click on
profiles
under one of the schemes. This is what a MLST scheme looks like. -
Now go back to the previous page, and click on one of the genes in the #1 scheme. You will now get a long list of fasta sequences on your screen. As you can see, each of them have a number behind them. These numbers correspond to the numbering in the MLST scheme.
-
OPTIONAL: download this file by right-clicking on the window and selecting
Save as...
. -
From within the VM, go to the menu, and click
Applications - Bioinformatics - Jalview
. -
Use the
File
menu and load in the file you just downloaded viaInput alignment - From file
. Remember, yousaved it as text, so you have to ask it to showAll files
. -
Go to the
Color
menu and selectPercentage identity
. You will now see the differences between the sequences. -
Close
Jalview
.
- Go read the ARIBA manual and also have a look at the presentation made by your
fellow students. Try to figure out the order the following commands have
to be run in, and what you have to fill in. Note. Our reads are E. colis,
and you should use the first schema.
- ariba pubmlstget "YOU NEED TO WRITE SOMETHING HERE" mlstdb
- ariba pubmlstspecies
- ariba run --threads 4 mlstdb/ref_db READ1.FQ READ2.FQ OUTPUTDIRNAME
- Figure out where the reads for your isolate is.
- Run the commands in the order they should be in.
- When you are done, you have a new directory with results in it.
The main files are
report.tsv
,mlst_report.tsv
andmlst_report.details.tsv
. Open each of them withless
and have a look at them. Note, you can find explanations of the files on the ARIBA wiki website. - Question: are your isolates from the same or different sequence types?
- Note, the results found in
mlst_report.tsv
can, when aggregated, be used as input for tools such asPhyloviz
, which is a java vizualisation program that we won't go into here.
It is important to know that different databases contain different things. Considering how specific gene finding works, we can only find things if those things are in the database that we're using for our finding. Thus, it is imperative for you to know that the things you are interested in are in the database that you are using.
ARIBA allows us to use several databases for AMR finding. Go have a look at the ARIBA wiki page to see which databases are available.
- Go to the ResFinder webpage
- Scroll down, and click on
The ResFinder database download site
. - Find the ResFinder database in the list, and click on
notes.txt
. This will download a text file to your computer. Click on it. - This file should now open in a text editor window. Have a look at it. If you choose to search with the ResFinder database, these are the genes that you will be searching for.
- Have a look at the Quinolone resistance section. Can you find the gyrA gene there?
- Go to the CARD database webpage
- Click on
Browse
. Then, look underAntibiotic Resistance Ontology
, click onARO Index
. - Set the
Show
to show 50 entries, and writefluoroquinolone
in the search box. - As you see, gyrA is among the search results.
Thus, be aware of the contents in the data base that you are using. If what you're looking for isn't in there, you cannot find it!
As was evident from looking at the results earlier, ARIBA can work with several databases. It can also download the current contents of the database. This means we can get up-to-date results.
However, we should be aware that ARIBA requires that the data it searches with is in a specific format, this is something that all available tools do. This means that after downloading data, it will format the contents. Throughout this process, it might then eliminate some genes that come wrongly formatted from the database. Thus, this is a second step where we need to check what we're searching with.
- Go read the ARIBA manual and have a look at the presentation made by your
fellow students. Try to figure out the order the following commands have
to be run in:
- ariba getref card carddb
- ariba run --threads 4 cardprepref READ1.FQ READ2.FQ OUTPUTDIRNAME
- ariba prepareref -f carddb.fa -m carddb.tsv cardprepref
- Figure out where the reads for your isolate is.
- Run the first two steps, where we download and prepare a database.
It might complain that it can't find spades. This is a bug/feature with this
version, which will be fixed.
Have a look at the log file for the prepareref step. This will tell you
about genes and variants that were removed. Also have a look in the
02.cdhit.clusters.tsv
file, this one will let you know about the clusters that we're actually including in our search. - Run the remaining
run
step. - Go into the result directory. Have a look at the
report.tsv
file. Answer the following questions. Note. the information found for therun
andprepareref
commands can be useful for finding answers.- Which AMR genes were found?
- Are these presence/absence AMR genes or mutation AMR genes?
- If there are any mutation AMR genes, which mutations do you have?
- Copy your partner's result directory into your own directory. Note,
copy -r
will let you copy a directory with its contents. You should now have the two directories side by side. - Run
ariba summary comparison TSV_REPORT_F27 TSV_REPORT_S19
. - Copy the three resulting files to your computer, and load them with the Phandango web browser.
Run ARIBA mlst and amr detection on the remaining four read sets.
Put up a table that describes your results. This should contain the MLST type for each genome, and their resistance profile.