-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Abstract core algorithm in to C library #96
Comments
The branch code_algo_in_c contains the beginnings of this. Commit 32c1ed6 contains first attempt at the abstraction problem. Currently I've tried to remove any refference to BAM objects from the class ReadClusterer, which this commit renames UMIClusterer. Read specifics are now delt with by the new ReadClusterer, which basically calls UMIClusterer and applys the results to the current bundle. Currently the 'UMIClusterer' if of the format:
Two problems with this so far
I feel like the ideal call signature would be: But this would loose too much infomation to be able to calculate stats. @TomSmithCGAT opinions? |
Hi @IanSudbery. This is a great idea. I'm very much on board! Regarding stats, as I see it the issue only relates to collecting the 'further stats' which provide information on the topology of the networks. Essentially this was added to dedup to generate statistics for the publication and I don't see it as a core functionality of dedup. It was implemented in a 'least hassle' manner without any real thought to maintaining the code long term. Personally, I'm happy to to drop the topology stats entirely. However, if we did want to keep this option, I think we could retain it quite more easily as shown below. As you point out, ReadClusterer now returns different objects depending on the UMI-tools command: list of umis (dedup) or groups of umis (group). This would be easier if the ReadClusterer was doing less of the work. Right now it's making the adjacency list, building the network and then either getting the groups or identifying the best read. One of the issues I had implementing the group command was making sure the groups were exactly the same as the theoretical groups from dedup although these are never determined. Perhaps we should always explicitly determine the groups first regardless and only then identify the top read? I'm not sure whether the following would make the C transcoding easier or harder though? networks = UMINetworker(counts, threshold) # C function
# If we retain further stats
if options.further_stats():
for network in networks:
topology_stats.update(summarise_toplogy(network)) # Undetermined method to update stats
for network in networks:
groups = getGroups(network, options.method) # C function
if not deduplicate: # if running group command write out and move on
write_groups... # write out groups
continue
best_read = GetBestRead(group) # else write out the best read
write_read... # write read
|
Hi Tom, I'd be happy to drop the further stats. I don't know if anyone actaully uses them. As for the seperating out the steps ... I thought about this, but I would like users to be able to use the library without really understanding what is going on under the hood. So I would like everything to happen in one step ideally. Of course, I can just about imagine a scenario where one command is run to select the groups and another to choose the best read from the group. Trouble is the two are connected by the method, and so you have to understand the methods(?) to use it properly. |
Yeah, let's drop it then. This will of course cause a minor backwards compatibility issue if anyone's using the option in a pipeline etc but I doubt anyone is using it either. We can always return to this if required. Each group contains a set of unique reads/UMIs (for a given location). If the C library function took UMI counts as input and returned ordered groups as output, the top read would just be group[0]. I think this would be a reasonable output and would ensure the outputs of dedup and group are compatible (e.g one could derive the dedup output from the group output). |
Is that the case irrespective of the method? |
Yup. Any one read should only be assigned to a single group and the read with the highest counts is always considered the 'best' read. |
Not true. Your group directional assumes that nodes in a cluster are ordered by count, but _get_connected_components_adjacency does not guarantee this. This may also be why the recursive search and the breadth first search return different things. |
Umm...that's a concern! I'll have a look into this later. Thanks! |
If I insert a sort statement in _group_directional: def _group_directional(self, clusters, adj_list, counts):
''' return groups for directional method'''
observed = set()
groups = []
for cluster in clusters:
if len(cluster) == 1:
groups.append(list(cluster))
observed.update(cluster)
else:
cluster = sorted(cluster, key = lambda x: counts[x], # <--- NEW SORT HERE
reverse = True)
# need to remove any node which has already been observed
temp_cluster = []
for node in cluster:
if node not in observed:
temp_cluster.append(node)
observed.add(node)
groups.append(temp_cluster)
return groups then |
And, with the above change, enabling the recursive search now gives the same result. |
Yes, just came to the same conclusion myself. Thanks for spotting this |
Ah, brilliant! |
Do you want me to patch the sort into the master? |
Yes please! Checking back on the previous issue regarding the recursive search (#69), I was getting a different result for dedup with directional with a high depth sample (not covered by tests) so I suspect this is still the case given the code for dedup isn't affected by this patch? |
Okay, lastest push (see f48e99d) contains the fully abstracted code as we discussed. I've also reimplemented group for adjacency to make it quicker as it initially doubled the run time. This has the benefit of making the code more compact. I've tested against the nosetests, and a couple of full size files (directional and adjacency). Things are the same up to a sort order. I've also created a simple unit test of the new abstracted code, based on figure 1 of the paper for development purposes. |
Great. So we're ready to try out a C library for the networks. Do you want me to go through and remove the further stats option and documentation - This will fail currently if the --further-stats options is supplied though I guess that's not a problem on this branch? |
In order to allow wider adoption of the UMI-tools deduplication methods beyond simple position + UMI based deduping we would like to create a C library that implemented the core algos.
As much as possible we would like to the abstracted C calls to take a list of UMIs and their counts, and return either a list of deduplicated UMIs or grouped UMIs according to the specified method choice.
The first step to this is create a python call that does this, abstracting it away from implementation details and references to the specific use here with BAM formatted reads.
We will then develop a set of unit test level test data to allow development of a C replacement that should give the same output as the python call.
Next an optimisation dataset will be used to optimise the C calls to be at least as efficient as the pyton call.
Test against a broad range of data
Develop ways for easy deployment of the C library, both alongside UMI-Tools and as a standalone that can be used by other tools for tasks that are related, but outside the core remit of UMI-Tools.
The text was updated successfully, but these errors were encountered: