You will need Maven 3 and Java 8 to build the project. The other dependencies such as UIMA, uimaFIT, StanfordNLP and OpenNLP are handled by maven. You can find a list of the dependencies in the pom.xml.
- Clone the repository.
git clone https://github.com/daimrod/csa.git
- Retrieve PubMed Open Access Corpus and CSV information. (more information)
- ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.A-B.tar.gz
- ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.C-H.tar.gz
- ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.I-N.tar.gz
- ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.O-Z.tar.gz
- ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv
- Extract the corpus in the directory of your choice, we will suppose it’s in
\~/corpus
. - Go to project directory and build it using maven.
mvn -Ddev=true package
- Create a file
annotator.conf
with the following information:- inputDirectory
- the directory containing the PubMed Corpus
- outputDirectory
- the directory used to store the results
- listArticlesFilename
- a file containing the name of the articles to read
- mappingFilename
- a file describing the patterns used (more on this later)
- windowSize
- the size of the citation context window
- Run the Annotator.
java -cp target/csa-1.0-SNAPSHOT.jar jgreg.internship.nii.WF.AnnotatorWF -config annotator.conf
- Create a file
statistic.conf
with the following information:- inputDirectory
- the directory containing the results previously computed
- mappingFilename
- same as before
- outputFile
- the file containing the extracted statistics
- infoFile
- the file containing some additional information
- Run the Statistic module.
java -cp target/csa-1.0-SNAPSHOT.jar jgreg.internship.nii.WF.StatisticsWF -config statistic.conf
Here is a example of the annotator.conf
file:
inputDirectory = ~/corpus/
outputDirectory = ~/workspace/output/
listArticlesFilename = ~/workspace/mylist.txt
mappingFilename = ~/workspace/hs-mapping.lst
windowSize = 1
The statistic.conf
file has the exact same syntax:
inputDirectory = ~/workspace/output/
mappingFilename = ~/workspace/hs-mapping.lst
outputFile = ~/workspace/output/all-out.dat
infoFile = ~/workspace/output/info.dat
The mapping file is used to describe the order of the annotation in the results and where to find the cue phrases for each annotation.
order = negative neutral positive
# Sentiment cues phrases
negative = ~/workspace/negative.pat
neutral = ~/workspace/neutral.pat
positive = ~/workspace/positive.pat
The Annotator module uses the Stanford NLP Token Sequence Matcher to match cues pharses. You can find a description of the accepted syntax here.
The pattern files must have one pattern per line, here are some examples of the accepted patterns:
good
/state-of-the-art/
{ tag:"NN" } achieve
You can dispatch the processing on N processes by splitting the list of articles in N chunks (e.g. using the split(1) command) and using the GNU Parallel tool.
For example, to use 20 cores:
split -n l/20 path/to/listArticlesFilename list-
ls list-* | parallel --halt 2 \
java -cp target/csa-1.0-SNAPSHOT.jar \
jgreg.internship.nii.WF.AnnotatorWF \
-config annotator.conf \
-listArticlesFilename {}
The split(1) command will split the file listArticlesFilename
in 20 files prefixed by “list-“. We then use the parallel(1) command to run as many java processes as input file and overriding the listArticlesFilename
parameter from the configuration file using a command line parameter.