Skip to content

koeppl/phoni

Repository files navigation

PHONI

(Practical Heuristic ON Incremental matching statistics computation)

This framework supports the currently memory-friendliest way to compute the matching statistics of a pattern on highly-repetitive texts, given that the input text is precomputed with the MONI index (more precisely, we need all ingredients of MONI except the thresholds).

We require the pattern and the text to be available in form of sequences stored in the .fa (FASTA) format. To use our solution, you need to have recent cmake, g++, zsh, and python 3 installed.

Preparations

We need the following python 3 packages for extracting and concatenating .fa files:

	pip3 install biopython
	pip3 install fastaparser
	pip3 install psutil
git clone --branch phoni https://github.com/koeppl/phoni

Compile

mkdir build
cd build; cmake ..
make

Building the index

To build the index we use the command phoni build from the build directory.

phoni build \
-r <filename of the reference> \
-t <number of threads> \
-g <grammar format> \
-f <input file is a fasta file> \

For example, to build the phoni index for the file yeast.fasta using 4 threads and the plain grammar we run from the build folder:

python3 phoni build -r ../data/yeast.fasta -f -t 4 -g plain

This command will produce yeast.fasta.phoni and yeast.fasta.plain.slp in the data folder, which represent the phoni index.

Querying the index

To query the index we use the command phoni query from the build directory.

phoni ms \
-r <filename of the reference> \
-p <number of threads> \
-g <grammar format> \

For example, to query the phoni index for the file yeast.fasta using the plain grammar with the pattern samples.fastq we run from the build folder:

python3 phoni build -r ../data/yeast.fasta -p ../data/samples.fa -g plain

This command will produce samples.fa.positions and samples.fa.lengths in the data folder, which represent the matching staistics positions and lengths of samples.fa against yeast.fasta, respectively.

Compatibility queries

To perform the queries using the old query command we first have to split the samples.fa file using the python tool splitpattern.py:

mkdir data/samples.fa.dir
python3 splitpattern.py data/samples.fa data/samples.fa.dir

Then we can query samples.fa against the yeast.fasta index with the plain grammar using the following command:

./build/test/src/phoni_compatibility data/yeast.fasta -p data/samples.fa -g plain

python Tools

  • prefixpattern.py: takes the x% prefix of each pattern stored in a .fa file and outputs a new .fa file to stdout
  • splitpattern.py: splits a .fa file into individual sequences stored in a directory given as program parameter (we need this for msfast as it does not support reading .fa files)

Benchmarks

We provide a script and benchmark files to evaluate PHONI in the setting as described in the paper

Christina Boucher, Travis Gagie, Tomohiro I, Dominik Köppl, Ben Langmead, Giovanni Manzini, Gonzalo Navarro, Alejandro Pacheco, Massimiliano Rossi: PHONI: Streamed Matching Statistics with Multi-Genome References, In proceedings of 2021 Data Compression Conference, pp. 193-202, 2021.

Christina Boucher, Travis Gagie, Tomohiro I, Dominik Köppl, Ben Langmead, Giovanni Manzini, Gonzalo Navarro, Alejandro Pacheco, Massimiliano Rossi: PHONI: Streamed Matching Statistics with Multi-Genome References, arXiv:2011.05610, 11 Nov 2020.

In our experiments we used the file

We have a shell script benchmark.sh for an automatic benchmark. For this to work, some variables in it has to be set, as this project does not ship with the other matching statistic algorithms, namely

meaning it is necessary to download and compile those projects individually, and the set the corresponding variables in benchmark.sh manually (more precisely: in the switch-case statement for the hostname in the beginning). Finally, the output of benchmark.sh can be processed by sqlplots to generate the plots shown in the paper.

To compute the naive PHONI variant evaluated in the paper, simple exchange lceToRBounded with lceToR_NaiveBounded in the file include/ms/phoni.hpp.