Hidden Markov Models for Genome Analysis
The project's goal is the development of a basic implementation of the pair hidden Markov Model (HMM) forward algorithm for genomic sequence analysis (described in [1]), with the introduction of concurrent computation through the use of OpenMP APIs.
Further details are provided in these articles ([2], [3]) and on the book Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (page 88, §4.2).
The main files that build up the project are:
- Sequence.h: class that represents the sequence of nucleotides, it contains the string of characters that compose the sequence and the class SequenceGenerator
- SequenceGenerator.h: class that defines a random emission probability distribution of a sequence of nucleotides. Currently, an instance of the class Sequence is randomly generated according to its SequenceGenerator.
- ProbabilityMatrix.h: class that represents a generic matrix of floating point values, from which the classes DynamicMatrix and StateTransitionMatrix inherit common attributes and methods. DynamicMatrix adds the possibility of adding rows and columns dynamically, while StateTransitionMatrix provides a series of states, and a mapping between them and the indexes of the matrix
- PairHMM.h: the class that implements the pair HMM forwarding algorithm, it encloses 2 instances of the class Sequence (one for defining the read sequence, and one for defining the haplotype sequence), 1 instance of the class StateTransitionMatrix (for defining matrix T), and 3 instances of the class DynamicMatrix (for the definition of matrices M, I and D)
- main.cpp: the entry point of the program, contains an instance of the class PairHMM and the call of its method for the execution of the PairHMM forwarding algorithm
The code is entirely written in C++ programming language, with the use of the following libraries and APIs (omitting the standard ones):
- random: used for the random generation of sequences and the random definition of state transition probabilities
- algorithm: used for the shuflling of sequences, used for randomization purposes
- OpenMP: used in PairHMM.cpp for introducing thread level computation in the algorithm
- Install MinGw64 version > 9.2 (otherwise the random generated sequence will be the same at each execution, as reported here and here)
- Install CMake
- Create folder for building project
mkdir build
cd build
- Generate the makefiles
cmake -G “MinGW Makefiles” ..
- build the project
cmake --build .
- run the program
./HMM4GA.exe