Skip to content

Commit

Permalink
Merge pull request #23 from 3D-e-Chem/similarity
Browse files Browse the repository at this point in the history
Renamed distance to similarity
  • Loading branch information
sverhoeven authored Jul 14, 2016
2 parents b8666b3 + 8cb43e3 commit d83e766
Show file tree
Hide file tree
Showing 22 changed files with 372 additions and 371 deletions.
56 changes: 31 additions & 25 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,82 +1,88 @@
# Change log
All notable changes to this project will be documented in this file.
This project adheres to [Semantic Versioning](http://semver.org/).
Formatted as described on http://keepachangelog.com/.

## Unreleased

## [2.0.0] - 2016-07-14

### Changed

* Flag to ignore upper triangle when calculating distances, instead of always ignore (#20)
- Renamed distance to similarity (#21)
- Flag to ignore upper triangle when calculating distances, instead of always ignore (#20)

## 1.4.2 - 3 June 2016
## [1.4.2] - 2016-06-03

### Changed

* Lower webservice cutoff to 0.45 (#18)
- Lower webservice cutoff to 0.45 (#18)

## 1.4.1 - 31 May 2016
## [1.4.1] - 2016-05-31

### Added

* Webservice online at http://3d-e-chem.vu-compmedchem.nl/kripodb/ui/
* Ignore_upper triangle option in distance import sub command
- Webservice online at http://3d-e-chem.vu-compmedchem.nl/kripodb/ui/
- Ignore_upper triangle option in distance import sub command

## 1.4.0 - 3 May 2016
## [1.4.0] - 2016-05-03

### Changed

* Using nested sub-commands instead of long sub-command. For example `kripodb distmatrix_import` now is `kripodb distances import`
- Using nested sub-commands instead of long sub-command. For example `kripodb distmatrix_import` now is `kripodb distances import`

### Added

* Faster distance matrix storage format
* Python3 support (#12)
* Automated build to docker hub.
- Faster distance matrix storage format
- Python3 support (#12)
- Automated build to docker hub.

### Removed

* CLI argument `--precision`
- CLI argument `--precision`

## 1.3.0 - 23 Apr 2016
## [1.3.0] - 2016-04-23

### Added

* webservice server/client for distance matrix (#16). The CLI and canned commands can now take a local file or a url.
- webservice server/client for distance matrix (#16). The CLI and canned commands can now take a local file or a url.

### Fixed

* het_seq_nr contains non-numbers (#15)
- het_seq_nr contains non-numbers (#15)

## 1.2.5 - 24 Mar 2016
## [1.2.5] - 2016-03-24

### Fixed

* fpneigh2tsv not available as sub command
- fpneigh2tsv not available as sub command

## 1.2.4 - 24 Mar 2016
## [1.2.4] - 2016-03-24

### Added

* Sub command to convert fpneight distance file to tsv.
- Sub command to convert fpneight distance file to tsv.

## 1.2.3 - 1 Mar 2016
## [1.2.3] - 2016-03-01

### Changed

* Converting distances matrix will load id2label lookup into memory to speed up conversion
- Converting distances matrix will load id2label lookup into memory to speed up conversion

## 1.2.2 - 22 Feb 2016
## [1.2.2] - 2016-02-22

### Added

- Added sub command to read fpneigh formatted distance matrix file (#14)

## 1.2.1 - 12 Feb 2016
## [1.2.1] - 2016-02-12

### Added

- Added sub commands to read/write distance matrix in tab delimited format (#13)
- Created repo for Knime example and plugin at https://github.com/3D-e-Chem/knime-kripodb (#8)

## 1.2.0 - 11 Feb 2016
## [1.2.0] - 2016-02-11

### Added

Expand All @@ -89,7 +95,7 @@
- Merging of distance matrix files more robust (#10)
- Tanimoto coefficient is rounded up (#7)

## 1.0.0 - 5 Feb 2016
## [1.0.0] - 2016-02-05

### Added

Expand Down
48 changes: 24 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,14 @@ KRIPO stands for Key Representation of Interaction in POckets, see [reference](h
* Subpocket, part of the protein pocket which binds with the fragment
* Fingerprint, fingerprint of structure-based pharmacophore of subpocket
* Similarity matrix, similarities between all fingerprint pairs calculated using the modified tanimoto similarity index
* Kripo identifier, used as identifier for fragment, subpocket and fingerprint
* Kripo fragment identifier, used as identifier for fragment, subpocket and fingerprint

# Install

Requirements:

* rdkit, http://rdkit.org, to read SDF files and generate smile strings from molecules
* libhdf5 headers, to read/write distance matrix in hdf5 format
* libhdf5 headers, to read/write similarity matrix in hdf5 format

```
pip install -U setuptools
Expand All @@ -48,42 +48,42 @@ kripodb fragments sdf fragment??.sdf fragments.sqlite
kripodb fragments pdb fragments.sqlite
kripodb fingerprints import 01.fp 01.fp.db
kripodb fingerprints import 02.fp 02.fp.db
kripodb fingerprints distances --fragmentsdbfn fragments.sqlite --ignore_upper_triangle 01.fp.db 01.fp.db dist_01_01.h5
kripodb fingerprints distances --fragmentsdbfn fragments.sqlite --ignore_upper_triangle 02.fp.db 02.fp.db dist_02_02.h5
kripodb fingerprints distances --fragmentsdbfn fragments.sqlite 01.fp.db 02.fp.db dist_01_02.h5
kripodb distances merge dist_*_*.h5 dist_all.h5
kripodb distances freeze dist_all.h5 dist_all.frozen.h5
# Make froze distance matrix smaller, by using slower compression
ptrepack --complevel 6 --complib blosc:zlib dist_all.frozen.h5 dist_all.packedfrozen.h5
rm dist_all.frozen.h5
kripodb distances serve dist_all.packedfrozen.h5
kripodb fingerprints similarities --fragmentsdbfn fragments.sqlite --ignore_upper_triangle 01.fp.db 01.fp.db sim_01_01.h5
kripodb fingerprints similarities --fragmentsdbfn fragments.sqlite --ignore_upper_triangle 02.fp.db 02.fp.db sim_02_02.h5
kripodb fingerprints similarities --fragmentsdbfn fragments.sqlite 01.fp.db 02.fp.db sim_01_02.h5
kripodb similarities merge sim_*_*.h5 sim_all.h5
kripodb similarities freeze sim_all.h5 sim_all.frozen.h5
# Make froze similarity matrix smaller, by using slower compression
ptrepack --complevel 6 --complib blosc:zlib sim_all.frozen.h5 sim_all.packedfrozen.h5
rm sim_all.frozen.h5
kripodb similarities serve sim_all.packedfrozen.h5
```

## Search for most similar fragments

Command to find fragments most similar to `3kxm_K74_frag1` fragment.
```
kripodb similar dist_all.h5 3kxm_K74_frag1 --cutoff 0.45
kripodb similar sim_all.h5 3kxm_K74_frag1 --cutoff 0.45
```

## Create distance matrix from text files
## Create similarity matrix from text files

Input files `dist_??_??.txt.gz` looks like:
Input files `sim_??_??.txt.gz` looks like:
```
Compounds similar to 2xry_FAD_frag4:
2xry_FAD_frag4 1.0000
3cvv_FAD_frag3 0.5600
```

To create a single distance matrix from multiple text files:
To create a single similarity matrix from multiple text files:
```
gunzip -c dist_01_01.txt.gz | kripodb distances import --ignore_upper_triangle - fragments.sqlite dist_01_01.h5
gunzip -c dist_01_02.txt.gz | kripodb distances import - fragments.sqlite dist_01_02.h5
gunzip -c dist_02_02.txt.gz | kripodb distances import --ignore_upper_triangle - fragments.sqlite dist_02_02.h5
kripodb distances merge dist_??_??.h5 dist_all.h5
gunzip -c sim_01_01.txt.gz | kripodb similarities import --ignore_upper_triangle - fragments.sqlite sim_01_01.h5
gunzip -c sim_01_02.txt.gz | kripodb similarities import - fragments.sqlite sim_01_02.h5
gunzip -c sim_02_02.txt.gz | kripodb similarities import --ignore_upper_triangle - fragments.sqlite sim_02_02.h5
kripodb similarities merge sim_??_??.h5 sim_all.h5
```

The `--ignore_upper_triangle` flag is used to prevent scores corruption when freezing distance matrix.
The `--ignore_upper_triangle` flag is used to prevent scores corruption when freezing similarity matrix.

# Data sets

Expand All @@ -96,7 +96,7 @@ An example data set included in the [data/](data/) directory of this repo. See [
All fragments based on GPCR proteins compared with all proteins in PDB.

* kripo.gpcrandhits.sqlite - Fragments sqlite database
* kripo.gpcr.h5 - HDF5 file with distance matrix
* kripo.gpcr.h5 - HDF5 file with similarity matrix

The data set has been published at [![DOI](https://zenodo.org/badge/doi/10.5281/zenodo.50835.svg)](http://dx.doi.org/10.5281/zenodo.50835)

Expand All @@ -106,8 +106,8 @@ All fragments form all proteins-ligand complexes in PDB compared with all.
Data set contains PDB entries that where available at 23 December 2015.

* kripo.sqlite - Fragments sqlite database
* Distance matrix is too big to ship with VM so use http://3d-e-chem.vu-compmedchem.nl/kripodb webservice url to query.
* kripo_fingerprint_2015_*.fp.gz - Fragment fingerprints, see [here](#create-distance-matrix-from-text-files) for instructions how to convert to a distance matrix.
* Similarity matrix is too big to ship with VM so use http://3d-e-chem.vu-compmedchem.nl/kripodb webservice url to query.
* kripo_fingerprint_2015_*.fp.gz - Fragment fingerprints, see [here](#create-similarity-matrix-from-text-files) for instructions how to convert to a similarity matrix.

The data set has been published at [![DOI](https://zenodo.org/badge/doi/10.5281/zenodo.55254.svg)](http://dx.doi.org/10.5281/zenodo.55254)

Expand Down Expand Up @@ -152,7 +152,7 @@ The Kripo data files can be queried using a web service.

Start webservice with:
```
kripodb serve --port 8084 data/distances.h5
kripodb serve --port 8084 data/similarities.h5
```
It will print the urls for the swagger spec and UI.

Expand Down
9 changes: 5 additions & 4 deletions data/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

* fragments.sqlite - Fragments sqlite database containing a small number of fragments with their smiles string and molblock.
* fingerprints.sqlite - Fingerprints sqlite database with fingerprint stored as [fastdumped intbitset](http://intbitset.readthedocs.org/en/latest/index.html#intbitset.intbitset.fastdump)
* distances.h5 - HDF5 file with distance matrix of fingerprints using modified tanimoto coefficient
* similarities.h5 - HDF5 file with similarities matrix of fingerprints using modified tanimoto similarity index

## Creating tiny data set

Expand All @@ -23,8 +23,9 @@ EOF
```

3. Create distance matrix
3. Create similarity matrix

```
kripodb fingerprints distances --fragmentsdbfn fragments.sqlite fingerprints.sqlite fingerprints.sqlite distances.h5
```
kripodb fingerprints similarities --fragmentsdbfn fragments.sqlite fingerprints.sqlite fingerprints.sqlite similarities.h5
```

File renamed without changes.
File renamed without changes.
38 changes: 16 additions & 22 deletions kripodb/canned.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,28 +13,28 @@
# limitations under the License.
"""Module with functions which use pandas DataFrame as input and output.
For using Kripo data files inside Knime (http://www.knime.org)
For using Kripo data files inside KNIME (http://www.knime.org)
"""

from __future__ import absolute_import

import tables

import pandas as pd
from kripodb.frozen import FrozenDistanceMatrix
from kripodb.frozen import FrozenSimilarityMatrix
from .db import FragmentsDb
from .hdf5 import DistanceMatrix
from .pairs import similar
from .hdf5 import SimilarityMatrix
from .pairs import similar, open_similarity_matrix
from .webservice.client import WebserviceClient


def similarities(queries, distance_matrix_filename_or_url, cutoff, limit=1000):
"""Find similar fragments to queries based on distance matrix.
def similarities(queries, similarity_matrix_filename_or_url, cutoff, limit=1000):
"""Find similar fragments to queries based on similarity matrix.
Args:
queries (List[str]): Query fragment identifiers
distance_matrix_filename_or_url (str): Filename of distance matrix file or base url of kripodb webservice
cutoff (float): Cutoff, distance scores below cutoff are discarded.
similarity_matrix_filename_or_url (str): Filename of similarity matrix file or base url of kripodb webservice
cutoff (float): Cutoff, similarity scores below cutoff are discarded.
limit (int): Maximum number of hits for each query.
Default is 1000. Use is None for no limit.
Expand All @@ -44,12 +44,12 @@ def similarities(queries, distance_matrix_filename_or_url, cutoff, limit=1000):
>>> import pandas as pd
>>> from kripodb.canned import similarities
>>> queries = pd.Series(['3j7u_NDP_frag24'])
>>> hits = similarities(queries, 'data/distances.h5', 0.55)
>>> hits = similarities(queries, 'data/similaritys.h5', 0.55)
>>> len(hits)
11
Retrieved from web service instead of local distance matrix file.
Make sure the web service is running, for example by `kripodb serve data/distances.h5`.
Retrieved from web service instead of local similarity matrix file.
Make sure the web service is running, for example by `kripodb serve data/similaritys.h5`.
>>> hits = similarities(queries, 'http://localhost:8084/kripo', 0.55)
>>> len(hits)
Expand All @@ -59,28 +59,22 @@ def similarities(queries, distance_matrix_filename_or_url, cutoff, limit=1000):
pandas.DataFrame: Data frame with query_fragment_id, hit_frag_id and score columns
"""
hits = []
if distance_matrix_filename_or_url.startswith('http'):
client = WebserviceClient(distance_matrix_filename_or_url)
if similarity_matrix_filename_or_url.startswith('http'):
client = WebserviceClient(similarity_matrix_filename_or_url)
for query in queries:
qhits = client.similar_fragments(query, cutoff, limit)
hits.extend(qhits)
else:
f = tables.open_file(distance_matrix_filename_or_url, 'r')
is_frozen = 'scores' in f.root
f.close()
if is_frozen:
distance_matrix = FrozenDistanceMatrix(distance_matrix_filename_or_url)
else:
distance_matrix = DistanceMatrix(distance_matrix_filename_or_url)
similarity_matrix = open_similarity_matrix(similarity_matrix_filename_or_url)
for query in queries:
for query_id, hit_id, score in similar(query, distance_matrix, cutoff, limit):
for query_id, hit_id, score in similar(query, similarity_matrix, cutoff, limit):
hit = {'query_frag_id': query_id,
'hit_frag_id': hit_id,
'score': score,
}
hits.append(hit)

distance_matrix.close()
similarity_matrix.close()

return pd.DataFrame(hits)

Expand Down
Loading

0 comments on commit d83e766

Please sign in to comment.