Skip to content
This repository has been archived by the owner on Aug 29, 2020. It is now read-only.

Reorganising cluster data #144

Open
wants to merge 4 commits into
base: develop
Choose a base branch
from
Open

Conversation

ialarmedalien
Copy link
Collaborator

Reorganising cluster data to be a single field with an array of clusters in the form <cluster_name>:<cluster_id>. The current set of clusters have been renamed from 'cluster_i2' to 'markov_i2' as they were created using Markov clustering with inflation set to 2.

…ers in the form <cluster_name>:<cluster_id>. The current set of clusters have been renamed from 'cluster_i2' to 'markov_i2' as they were created using Markov clustering with inflation set to 2.
Comment on lines +4 to +6
docker-compose down
docker-compose run spec sh /app/test/run_tests.sh
docker-compose down
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure that any docker containers that are hanging around don't accidentally contaminate test runs.

Comment on lines +1 to +2
# prefix: markov_i2
# title: Markov clustering, inflation = 2
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll formalise this with the Jacobson group some time in the coming week or two. Right now, the cluster prefix is hardcoded in the parser, but ideally I'd get this information from the cluster file or from a metadata file in the clusters directory.

Comment on lines -121 to -129
def test_fetch_phenotypes_no_results(self):

resp = self.submit_query('djornl_fetch_phenotypes', {
# gene node
"keys": ["AT1G01010"],
})
self.assertEqual(resp['results'][0], self.no_results)


Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combined the tests for fetch calls with no results into the calls with results


@unittest.skip('This test is disabled until automated view loading is possible')
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enabled this test

@jayrbolton
Copy link
Contributor

Looks good, only had one comment above

…es that make up the data, plus code to validate the manifest.

Created manifests for all test files and updated tests accordingly
Added djornl data source (github repo)
Comment on lines +8 to +20
"oneOf": [
{
"properties": {
"data_type": { "enum": ["cluster"] }
},
"required": [ "prefix" ]
},
{
"properties": {
"data_type": { "enum": [ "node", "edge" ] }
}
}
],
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the data type is "cluster", there must be a "prefix" field, which dictates the cluster label

@lgtm-com
Copy link

lgtm-com bot commented Jul 20, 2020

This pull request introduces 4 alerts when merging 93205a2 into 53eecfe - view on LGTM.com

new alerts:

  • 2 for Unused import
  • 1 for Module is imported more than once
  • 1 for Unused local variable

Comment on lines +291 to +311
def check_deltas(self, edge_data={}, node_metadata={}, cluster_data={}):

edge_nodes = set([e['_key'] for e in edge_data['nodes']])
node_metadata_nodes = set([e['_key'] for e in node_metadata['nodes']])
cluster_nodes = set([e['_key'] for e in cluster_data['nodes']])
all_nodes = edge_nodes.union(node_metadata_nodes).union(cluster_nodes)

# check all nodes in cluster_data have node_metadata
clstr_no_node_md_set = cluster_nodes.difference(node_metadata_nodes)
if clstr_no_node_md_set:
print({'clusters with no node metadata': clstr_no_node_md_set})

# check all nodes in the edge_data have node_metadata
edge_no_node_md_set = edge_nodes.difference(node_metadata_nodes)
if edge_no_node_md_set:
print({'edges with no node metadata': edge_no_node_md_set})

# count all edges
print("Dataset contains " + str(len(edge_data['edges'])) + " edges")
# count all nodes
print("Dataset contains " + str(len(all_nodes)) + " nodes")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some basic sanity checks on the parsed data, plus a very high level overview of what is being added

# self.json_data[query][primary_param][distance_param]
# if primary_param is an array, join the array entities with "__"
# self.json_data[query_name][param_name][param_value]["distance"][distance_param]
# e.g. for fetch_clusters data:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reorganised the results data, so retrieving the appropriate results has changed further down in this file

Copy link
Contributor

@jayrbolton jayrbolton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had one comment, otherwise it is looking good

'user_notes': cols[19],
}
nodes.append(doc)
headers = []
Copy link
Contributor

@jayrbolton jayrbolton Jul 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is a bunch of repeated parsing code between the node and edge loaders we could have a generator function for reuse:

def iterate_rows(path):
    expected_col_count = 0
    with open(path) as fd:
        csv_reader = csv.reader(fd, delimiter='\t')
        line_no = 0
        for row in csv_reader:
            line_no += 1
            if len(row) <= 1 or row[0][0] == '#':
                # comment / metadata
                continue
            cols = [c.strip() for c in row]
            if line_no == 0:
                expected_col_count = len(cols)
                line_no += 1
                yield cols
            if len(cols) != expected_col_count:
                raise RuntimeError(f"{path} line {line_no}: expected {expected_col_count} cols, found {len(cols)}")
            line_no += 1
            yield cols

I just wrote this out without testing, but it should have the general idea.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I may as well make the separator also configurable/guessable from info in the manifest file.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants