Note: I have been using Adjutant of analyze an open dataset of COVID-19 papers. I've made this analysis available online in a different repository: https://github.com/amcrisan/covid-19-papers. Please feel free to contact me if you'd like additional information on this analysis.
Note: There appears to have been some changes to some of Adjutant's dependencies that will introduce NCBI connection errors. I am looking into the problem - thanks for those that brought it to my attention.
Adjutant is an open-source, interactive, and R-based application to support mining PubMed for a systematic or a literature review. Given a PubMed-compatible search query, Adjutant downloads the relevant articles and allows the user to perform an unsupervised clustering analysis to identify data-driven topic clusters. Users can also sample documents using different strategies to obtain a more manageable dataset for further analysis. Adjutant makes explicit trade-offs between speed and accuracy, which are modifiable by the user, such that a complete analysis of several thousand documents can take a few minutes. All analytic datasets generated by Adjutant are saved, allowing users to easily conduct other downstream analyses that Adjutant does not explicitly support.
We've also provided a detailed video to you help install Adjutant and to give you a tour of its functionality : https://vimeo.com/259442489
If you use Adjutant, please cite:
Adjutant: an R-based tool to support topic discovery for systematic and literature reviews Anamaria Crisan, Tamara Munzner, Jennifer L. Gardy Oxford Bioinformatics; doi: 10.1093/bioinformatics/bty722
In addition to the publication we have provided some extensive documentation on Adjutant's inner workings, from the specific R packages used in implementation, the the quality of the clustering results using both real and synthetic data. A pdf document of these details have been made available online.
An R notebook is also available to download and run the analysis - due to the large size of the analysis files the additional data and notebook is not shipped with Adjutant, but can be downloaded as a compressed file.
Download the latest development code of Adjutant from GitHub using devtools with
devtools::install_github("amcrisan/Adjutant")
If you've got any download problems, or spot some bugs, please log an issue in the github repo.
Maybe you want to use Adjutant (you read about it, you saw it somewhere, you're my mom), but you don't really know what R is, so you're not sure where to start. Here's how to get going:
- You need to download R onto your computer
- You need to download RStudio
You may run into trouble if you work in a place that prevents you from downloading and installing applications to your computer - this means you might need to us a home computer (but also talk to your IT team about why R is the best thing ever).
Once you have downloaded R and R Studio, you will open RStudio in order to install Adjutant and some additional packages. You can also view the video tutorial of how to install Adjutant here: https://vimeo.com/259442489
- You need to install the devtools package.
Within R or the RStudio console window (left, bottom), type the following:
install.packages("devtools")
Now we can install Adjutant! So type the following:
devtools::install_github("amcrisan/Adjutant")
NCBI is limiting the amount of requests its servers handle from a single API address, when using Adjutant this often manifests as a connection error. I've added some functionality into Adjutant that limits how often it queries Pubmed each second, but if you register for an API key you may encounter fewer connection issues.
If it pretty straight forward to get an NCBI API key, you can follow the online instructions.
You can use your the NCBI API key with Adjutant when using the shiny application or when using the command line functions. Adjutant will automatically append the NCBI API key to your search. You can use Adjutant without an API key, but you might run into connection problems.
See a history of all the news in NEWS.txt.
2020/14/01 : New Beta Features Some new features that have been sitting in the dev branch have now been merged to the main branch. These featuers are:
-
the ability to search semantic scholar : process_ScholarSearch
-
use adjutant's commands with an existing data frame or text file : processsSingleFile
-
speeded up for tidying the data corpus
-
add custom stop words to tidy corpus cleaning
All of these beta features are on the command line only, not changes to the UI yet.
Some things in the pipe that dedicated Adjutant users might have some fun with are exploring the replacement of t-SNE with UMAP (should speed things up). Also looking to update the UI, but still working on the right direction.
Adjutant can be used through its attendant Shiny Application or within your own R script, although the design of Adjutant is driven more towards the Shiny App usage. Adjutant can be used just for looking up and downloading articles from PubMed to your computer, or performing a topic clustering analysis and/or a sophisticated document sampling.
Video coming soon
With Adjutant installed, you just need to type two commands to get it going.
library(adjutant) #this gets R ready to run Adjutant
runAdjutant() #this will launch Adjutant user interface
That's it! Have a lot of fun exploring!
It is also possible to use Adjutant within your own code, bypassing the Shiny application all together.
Step 0: Load Adjutant
To run this demo the first thing you need to do is load Adjutant (once you've downloaded it), and some additional packages for analysis.
library(adjutant)
library(dplyr)
library(ggplot2)
library(tidytext) #for stop words
#also set a seed - there is some randomness in the analysis.
set.seed(416)
Step 1: Downloading data from PubMed to your computer
processSearch is Adjutant's PubMed search function and is effectively a wrapper for RISmed that formats RISmed's output into a clean data frame, with additional PubMed metadata (PubMed central citation count, language, article type etc). You can pass RISmed's EUtilsSummary parameters to the Adjutant's processSearch function.
Please note that Adjutant's downstream methods expect a dataframe with the column names that are produced by the processSearch method.
Depending upon the size of the query, it can take a few minutes to download all of the data.
df<-processSearch("(outbreak OR epidemic OR pandemic) AND genom*",retmax=2000)
If you have an NCBI API key, you should include it within processSearch. See the 'NCBI API Key' section of this readme for more details. You can also choose not retrieve missing abstracts (which happens infrequently), which also speeds up the search process.
ncbi_key<-"A_key_I_made_up"
#ncbi_key is a parameter that takes your ncbi_api key value
#forceGet is a parameter that indicates if you should try retrieve missing abstracts (default:TRUE) or not (FALSE).
df<-processSearch("(outbreak OR epidemic OR pandemic) AND genom*",ncbi_key = ncbi_key, forceGet=FALSE,retmax=2000)
Step 2: Generating a tidy text corpus
Adjutant next constructs a per-article "bag-of-words" arranged in a tidy text format. This step is necessary for further analysis in Adjutant, but can also be an interesting data set for other kinds of analysis not supported by Adjutant.
tidy_df<-tidyCorpus(corpus = df)
Step 3: Performing a dimensionality reduction using t-SNE
To learn more about t-SNE, please consult this excellent article in Distill Pub. Generally, Adjutant's topic clustering works best when there are a large number of diverse articles. It is still possible to get reasonable results with a smaller number of articles (fewer than 1,000 for example), and with a very homogenous dataset (for example just recent zika virus articles), but more general searches with large number of documents work best.
tsneObj<-runTSNE(tidy_df,check_duplicates=FALSE)
#add t-SNE co-ordinates to df object
df<-inner_join(df,tsneObj$Y,by="PMID")
# plot the t-SNE results
ggplot(df,aes(x=tsneComp1,y=tsneComp2))+
geom_point(alpha=0.2)+
theme_bw()
Below is an initial plot of our documents! Each point is one article, darker points means many overlapping articles, and you'll notice that some articles already form visible groups (clusters), in the next step we'll suss out which clusters are reasonably well defined (according to the algorithm).
Step 4: Perform an unsupervised clustering using hdbscan
If you use hdbscan through Adjutant, the result will be the Adjuntant optimized hdbscan minPts parameter. You can also just use hdbscan directly on the result t-SNE dimensionally reduced data and sort out what the best hbdscan minPts parameter is for yourself.
#run HDBSCAN and select the optimal cluster parameters automaticallu
optClusters <- optimalParam(df)
#add the new cluster ID's the running dataset
df<-inner_join(df,optClusters$retItems,by="PMID") %>%
mutate(tsneClusterStatus = ifelse(tsneCluster == 0, "not-clustered","clustered"))
# plot the HDBSCAN clusters (no names yet)
clusterNames <- df %>%
dplyr::group_by(tsneCluster) %>%
dplyr::summarise(medX = median(tsneComp1),
medY = median(tsneComp2)) %>%
dplyr::filter(tsneCluster != 0)
ggplot(df,aes(x=tsneComp1,y=tsneComp2,group=tsneCluster))+
geom_point(aes(colour = tsneClusterStatus),alpha=0.2)+
geom_label(data=clusterNames,aes(x=medX,y=medY,label=tsneCluster),size=2,colour="red")+
stat_ellipse(aes(alpha=tsneClusterStatus))+
scale_colour_manual(values=c("black","blue"),name="cluster status")+
scale_alpha_manual(values=c(1,0),name="cluster status")+ #remove the cluster for noise
theme_bw()
Below is the same plot as before, but now we've got some clusters that Adjutant thinks are reasonable.
Step 5: Naming the clusters
Adjutant has a function called getTopTerms that will automatically name a cluster according to its the top two most commonly occuring terms. If there are ties, it will return all.
clustNames<-df %>%
group_by(tsneCluster)%>%
mutate(tsneClusterNames = getTopTerms(clustPMID = PMID,clustValue=tsneCluster,topNVal = 2,tidyCorpus=tidy_df)) %>%
select(PMID,tsneClusterNames) %>%
ungroup()
#update document corpus with cluster names
df<-inner_join(df,clustNames,by=c("PMID","tsneCluster"))
#re-plot the clusters
clusterNames <- df %>%
dplyr::group_by(tsneClusterNames) %>%
dplyr::summarise(medX = median(tsneComp1),
medY = median(tsneComp2)) %>%
dplyr::filter(tsneClusterNames != "Not-Clustered")
ggplot(df,aes(x=tsneComp1,y=tsneComp2,group=tsneClusterNames))+
geom_point(aes(colour = tsneClusterStatus),alpha=0.2)+
stat_ellipse(aes(alpha=tsneClusterStatus))+
geom_label(data=clusterNames,aes(x=medX,y=medY,label=tsneClusterNames),size=3,colour="red")+
scale_colour_manual(values=c("black","blue"),name="cluster status")+
scale_alpha_manual(values=c(1,0),name="cluster status")+ #remove the cluster for noise
theme_bw()
Finally, here's the same plot again, but now with the names of the clusters. You'll notice that some clusters have very specific names (zika-viru; ebola viru) and some have more generic names (sequenc-isol-outbreak). My hypothesis for what is happening is a classic signal issue - some pathogens with only a few articles seem to cluster together into these more generic clusters, whereas pathogens with a lot of publications (and hence strong signal in the data) tend to form their own clusters.
Step 6: Go forth and analyze some more!
You can use df, tidy_df, or any other object produced by Adjutant in your own analysis!