- NumPy
- Pandas
- Matplotlib
- Cython(Optional, for speeding up algorithms)
P.S:
Cython
compiler requires a C++ compiler, and if you use Windows 8/8.1/10
, please download VC++2017 Build Tools from Visual Studio 2017 Installer.
Next, find out where vcvarsall.bat is in your computer, finally, add its path into system environment PATH.
use download/dsspGet.py.
see configurations at downloader/dsspPKU.json.
cd downloader && python dsspGet.py dsspPKU.json
from preprocess.BioParser import bio_parse
dataframe = bio_parse('./dssp/sources/1a00.dssp')
AA = dataframe.AA # amino acid
Structure = dataframe.STRUCTURE # secondary structure
from research.n_gram import make_gram
from research.specific_regular import specific_report
cases = np.array([AA, Structure]).T
res = [each_gram.T.flatten() for each_gram in make_gram(cases, 5)]
frequency = specific_report(res, {1:'A', 2:'A'})
print(frequency.values())
Explanation:
c1 :
('V', # 1-st amino acid.
'A', # 2-nd amino acid.
'D', # ...
'A', # ...
'L', # ...
'H X S+ ', # 1-st secondary structure
'H X S+ ', # ...
'H X S+ ', # ...
'H X S+ ', # ...
'H X S+ ' # 5-th secondary structure
),
c2 :
...
sources = ['./dssp/sources/1a00.dssp',
'./dssp/sources/1a0a.dssp',
'./dssp/sources/1a0b.dssp',
'./dssp/sources/1a0c.dssp',
'./dssp/sources/1a0d.dssp']
from research.datasets_report import DatasetsReport
from research.plot import plot_frequency
whole = DatasetsReport(*sources).analyze(filtf=lambda probability, std, mean: probability>0.4)
number_of_dist = len(whole)
for test_some_case_dist in list(whole.keys())[:20]:
plot_frequency(whole[test_some_case_dist])
Some figures have been output, however I have to put quite few ones here: