Reorganising cluster data #144

ialarmedalien · 2020-07-16T15:01:25Z

Reorganising cluster data to be a single field with an array of clusters in the form <cluster_name>:<cluster_id>. The current set of clusters have been renamed from 'cluster_i2' to 'markov_i2' as they were created using Markov clustering with inflation set to 2.

…ers in the form <cluster_name>:<cluster_id>. The current set of clusters have been renamed from 'cluster_i2' to 'markov_i2' as they were created using Markov clustering with inflation set to 2.

ialarmedalien · 2020-07-16T15:02:01Z

Makefile

+	docker-compose down
 	docker-compose run spec sh /app/test/run_tests.sh
+	docker-compose down


make sure that any docker containers that are hanging around don't accidentally contaminate test runs.

importers/djornl/parser.py

ialarmedalien · 2020-07-16T15:06:26Z

...est_data/cluster_data/out.aranetv2_subnet_AT-CX_top10percent_anno_AF_082919.abc.I2_named.tsv

+# prefix: markov_i2
+# title: Markov clustering, inflation = 2


I'll formalise this with the Jacobson group some time in the coming week or two. Right now, the cluster prefix is hardcoded in the parser, but ideally I'd get this information from the cluster file or from a metadata file in the clusters directory.

ialarmedalien · 2020-07-16T15:07:22Z

test/stored_queries/test_djornl.py

-    def test_fetch_phenotypes_no_results(self):
-
-        resp = self.submit_query('djornl_fetch_phenotypes', {
-            # gene node
-            "keys": ["AT1G01010"],
-        })
-        self.assertEqual(resp['results'][0], self.no_results)
-
-


Combined the tests for fetch calls with no results into the calls with results

ialarmedalien · 2020-07-16T15:09:11Z

test/stored_queries/test_djornl.py


-    @unittest.skip('This test is disabled until automated view loading is possible')


enabled this test

schemas/djornl/djornl_node.yaml

jayrbolton · 2020-07-16T20:20:59Z

Looks good, only had one comment above

…es that make up the data, plus code to validate the manifest. Created manifests for all test files and updated tests accordingly Added djornl data source (github repo)

ialarmedalien · 2020-07-20T22:29:06Z

importers/djornl/manifest.schema.json

+    "oneOf": [
+      {
+        "properties": {
+          "data_type": { "enum": ["cluster"] }
+        },
+        "required": [ "prefix" ]
+      },
+      {
+        "properties": {
+          "data_type": { "enum": [ "node", "edge" ] }
+        }
+      }
+    ],


if the data type is "cluster", there must be a "prefix" field, which dictates the cluster label

lgtm-com · 2020-07-20T22:31:22Z

This pull request introduces 4 alerts when merging 93205a2 into 53eecfe - view on LGTM.com

new alerts:

2 for Unused import
1 for Module is imported more than once
1 for Unused local variable

ialarmedalien · 2020-07-20T22:32:13Z

importers/djornl/parser.py

+    def check_deltas(self, edge_data={}, node_metadata={}, cluster_data={}):
+
+        edge_nodes = set([e['_key'] for e in edge_data['nodes']])
+        node_metadata_nodes = set([e['_key'] for e in node_metadata['nodes']])
+        cluster_nodes = set([e['_key'] for e in cluster_data['nodes']])
+        all_nodes = edge_nodes.union(node_metadata_nodes).union(cluster_nodes)
+
+        # check all nodes in cluster_data have node_metadata
+        clstr_no_node_md_set = cluster_nodes.difference(node_metadata_nodes)
+        if clstr_no_node_md_set:
+           print({'clusters with no node metadata': clstr_no_node_md_set})
+
+        # check all nodes in the edge_data have node_metadata
+        edge_no_node_md_set = edge_nodes.difference(node_metadata_nodes)
+        if edge_no_node_md_set:
+            print({'edges with no node metadata': edge_no_node_md_set})
+
+        # count all edges
+        print("Dataset contains " + str(len(edge_data['edges'])) + " edges")
+        # count all nodes
+        print("Dataset contains " + str(len(all_nodes)) + " nodes")


some basic sanity checks on the parsed data, plus a very high level overview of what is being added

ialarmedalien · 2020-07-20T22:34:12Z

test/stored_queries/test_djornl.py

-    # self.json_data[query][primary_param][distance_param]
-    # if primary_param is an array, join the array entities with "__"
+    # self.json_data[query_name][param_name][param_value]["distance"][distance_param]
+    # e.g. for fetch_clusters data:


reorganised the results data, so retrieving the appropriate results has changed further down in this file

jayrbolton

I had one comment, otherwise it is looking good

jayrbolton · 2020-07-21T22:49:39Z

importers/djornl/parser.py

-                    'user_notes': cols[19],
-                }
-                nodes.append(doc)
+        headers = []


Since there is a bunch of repeated parsing code between the node and edge loaders we could have a generator function for reuse:

def iterate_rows(path): expected_col_count = 0 with open(path) as fd: csv_reader = csv.reader(fd, delimiter='\t') line_no = 0 for row in csv_reader: line_no += 1 if len(row) <= 1 or row[0][0] == '#': # comment / metadata continue cols = [c.strip() for c in row] if line_no == 0: expected_col_count = len(cols) line_no += 1 yield cols if len(cols) != expected_col_count: raise RuntimeError(f"{path} line {line_no}: expected {expected_col_count} cols, found {len(cols)}") line_no += 1 yield cols

I just wrote this out without testing, but it should have the general idea.

Yeah, I may as well make the separator also configurable/guessable from info in the manifest file.

Reorganising cluster data to be a single field with an array of clust…

08bfb58

…ers in the form <cluster_name>:<cluster_id>. The current set of clusters have been renamed from 'cluster_i2' to 'markov_i2' as they were created using Markov clustering with inflation set to 2.

ialarmedalien requested a review from jayrbolton as a code owner July 16, 2020 15:01

ialarmedalien commented Jul 16, 2020

View reviewed changes

importers/djornl/parser.py Outdated Show resolved Hide resolved

ialarmedalien commented Jul 16, 2020

View reviewed changes

jayrbolton reviewed Jul 16, 2020

View reviewed changes

schemas/djornl/djornl_node.yaml Show resolved Hide resolved

Adding in manifest and manifest schema for indicating the list of fil…

93205a2

…es that make up the data, plus code to validate the manifest. Created manifests for all test files and updated tests accordingly Added djornl data source (github repo)

ialarmedalien commented Jul 20, 2020

View reviewed changes

ialarmedalien added 2 commits July 20, 2020 15:41

fixing LGTM errors

9892d70

Adding fake file to get dir to show up in git

e293889

jayrbolton suggested changes Jul 21, 2020

View reviewed changes

This was referenced Aug 18, 2020

Cluster the cluster data kbase/relation_engine#18

Merged

Add manifests and refactor DJORNL parser kbase/relation_engine#19

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reorganising cluster data #144

Reorganising cluster data #144

ialarmedalien commented Jul 16, 2020

ialarmedalien Jul 16, 2020

ialarmedalien Jul 16, 2020

ialarmedalien Jul 16, 2020

ialarmedalien Jul 16, 2020

jayrbolton commented Jul 16, 2020

ialarmedalien Jul 20, 2020

lgtm-com bot commented Jul 20, 2020

ialarmedalien Jul 20, 2020

ialarmedalien Jul 20, 2020

jayrbolton left a comment

jayrbolton Jul 21, 2020 •

edited

Loading

ialarmedalien Jul 21, 2020

		# prefix: markov_i2
		# title: Markov clustering, inflation = 2


		@unittest.skip('This test is disabled until automated view loading is possible')

Reorganising cluster data #144

Are you sure you want to change the base?

Reorganising cluster data #144

Conversation

ialarmedalien commented Jul 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayrbolton commented Jul 16, 2020

Choose a reason for hiding this comment

lgtm-com bot commented Jul 20, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayrbolton left a comment

Choose a reason for hiding this comment

jayrbolton Jul 21, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayrbolton Jul 21, 2020 •

edited

Loading