Ac refactor #122

Acribbs · 2024-04-01T14:15:54Z

No description provided.

…pecific

…ly lies in a pipeline workflow rather than generic script

…quently. Aiming to reduce code so that maintenance is reduced to core functionality

…fficiently using pandas

Acribbs · 2024-04-01T15:15:11Z

@IanSudbery @sebastian-luna-valero @david-cgat @AndreasHeger @jscaber and everyone.. I realised today that pandas and numpy has features deprecated that caused our bam2bam to fail (plus other scripts). While fixing these issues I realised how bloated the cgat-apps is and have made an effort to reduce some code (Starting with scripts) so it can be better maintained. Im going to merge but please take a look at scripts that I have removed and if there are any that you think should be returned then let me know and I can revert some commits.

IanSudbery · 2024-04-01T21:34:21Z

I'm not against retiring old, unused scripts, but I'm not sure just pulling them without any process or warning is a good way forwards.

I suggest we instigate some sort of life cycle policy. Mark tools with a deprecation warning some period before they are removed. We should probably also leave a stub in that says the tool has been removed when called, rather than just throwing an error.

I can't speak to all the things removed here, but I know that we use bam2peakshape, which is useful for viewing any continuous signal over discrete regions. We've used it with ATAC-seq, TT-seq, mNET-seq just off the top of my head. Its also the example that we use in the paper to demonstrate the capabilities of the tool set.

I'd also like to speak up for combine_tables, which I've used a lot in the past. I wouldn't assume that pandas is more efficient, as pandas is often very inefficient when it comes to memory usage. I also believe that in --cat mode (in in how it is/was used in P.concatenateAndLoad), there is no reason why combine_tables shouldn't use effectively zero memory. The final advantage of these sorts of things, is while it might be true that there are now good ways of achieving the same thing in python (or R), sometimes you just want something you can throw in a bash statement without writing a new python script.

That said, I don't know how much anything is used these days, as I personally don't do much actual coding, and people in my lab find discovery a problem (where as most of this stuff is just in my head).

Acribbs · 2024-04-01T21:47:48Z

Happy to re-introduce anything people find useful, I just dont have too much time to fix the updates that come from maintaining all the code now (Although I still use cgat-apps a lot), so reducing as much as possible would be best. It took a while to get the code up to standard to support python >3.9 today, lots of fixes related to updates in numpy and pandas.

The reason for removing the bam2peakshape was because I thought that it was more ChIP-seq pipeline specific, but will revert as I wasn't sure if its used generally.

Regarding P.concatenateAndLoad, this should be standalone code now in cgat-core and it should be completely independent of cgat-apps, should be called P.concatenate_and_load now I think.

Acribbs · 2024-04-01T21:51:22Z

The worrying thing was that we have a python >3.8 build on conda and it is likely broken because the code hasnt been maintained for a while. It was only today that I decided to give later versions of python (I usually pin to py 3.8) a try and found them broken. Then I went down a rabbit hole of trying to fix everything.

IanSudbery · 2024-04-02T10:53:34Z

I completely get where you are coming from! I'd love to say that I would put effort into maintaining them, but I know I'd never follow though.

And I am sure there is lots of deadwood in there.

I just feel like we should have some sort of defined process for removing things that are no longer used by anyone.

When you say broken, do you mean they error, or that they no longer pass the regession tests?

Acribbs · 2024-04-04T07:26:01Z

Ok see your point, from now on for scripts I can initiate a deprecation cycle.

For scripts that you think are useful that I removed these can easily be reversed as i made sure i made one commit for each script removal. Or if you have pipelines that break, use a previous pin to conda for time being.

The code was broken for any python version >3.8 because of changes to pandas and numpy libs. We only tested on github actions up to py3.8 so none of this was caught (I now test up to py10, but think we should probably build and test for up to py3.12 too). bioconda had a build for python 3.10 released for our old code and would break if used. The tests on bioconda are basic so this needs to be picked up in our testing. Thanksfully, bioconda blacklisted cgat-apps anyway because the builds were consuming too much RAM, because of lots of high dependancies. I did a code review of the other bits yesterday and have realised that there is so much code that just does nothing. I suspect that if I dont clean that up the code then we will get blacklisted again and its a pain to manually keep releasing cgat-apps recipe.

Acribbs · 2024-04-04T08:30:06Z

Example deprecation warning given in #124, this ok?

IanSudbery · 2024-04-04T10:29:52Z

I think the deprecation messages are fine. I would roll a release before the deprecation goes ahead. Kind of a "this is the last release that will contain these functions".

We should probably keep a table of what was deprecated and when.

When we do the deprecation, we should probably replace the removed scripts/code with stubs that just print a deprecation message that then raise an error (your warnings could be replaced with errors for e.g., and the code itself just replaced with pass), for, say, a year.

We definately don't want to be black listed on bioconda, so we need to do what it takes for that not to happen.

What do you mean by "does nothing"?

You mean functions that are not called elsewhere in cgat-apps or in the cgat-flow, or functions that literally just return their parameters unalter etc?

Acribbs · 2024-04-04T12:25:38Z

Good idea regarding the table and can create one in the repo.
Going to work on the code changes now and see if I can remove some dependancies. already found a few such as futures, jinja2 and six.

Some code such as

cgat-apps/cgat/CBioPortal.py

Line 1 in c931a5d

'''CBioPortal.py - Interface with the Sloan-Kettering cBioPortal webservice

,

cgat-apps/cgat/IGV.py

Line 1 in c931a5d

'''

,

cgat-apps/cgat/SVGdraw.py

Line 1 in c931a5d

'''

all completely redundant and the last one hasnt had a major update since 11 years ago. Its code that isn't used anywhere and doesnt seem to have any useful general functionality.

There are other bits of code that I am unsure if are useful, like this:

cgat-apps/cgat/MEME.py

Line 4 in c931a5d

class MemeMotif:

. You updated last in 2020 so I assume its used in some capacity, but doesnt seems to be used in any scripts.

Acribbs · 2024-04-04T12:28:08Z

Also various files such as install.sh (broken anyway), PKG-INFO, COPYING (not needed as have license) seem to be redundant. And then there is the documentation, which is a whole other story of messiness which would take a while to fix, the latest read the docs is broken so would need to fix that first, or host them on github.

IanSudbery · 2024-04-04T13:56:10Z

Yes, I think it depends how you use cgat-apps. For me personally, I always used cgat-apps as a programming API as well as a collection of scripts. I probably import modules from cgat more often than I run the scripts (not that I do either much these days) - the GTF and Bed modules in particular are absolutely gold, and remarkably provide functionality I'm not aware of existing anywhere else. Some of these things (like CBioportal and MEME) could conceivably be standalone packages - they are not in cgat-apps because they are used in either the scripts or the pipelines (although I think MEME is in the motifs pipeline, if that still exists?), but because they provide functionality to be used externally in their own right (We've used it in several pipelines). The other two I'm not sure about, but I suspect the story is similar. IGV.py I'm pretty sure was used in one of Andreas' original project pipelines - back in the day when "cgat" was just a directory on /ifs where everyone put all their code, and was included in cgat-apps because, well, someone might find it useful. You are right about the documentation, and this is a real barrier to anyone who doesn't know the code intimately from using cgat-apps. I'm pretty sure that those two modules above are entirely undocumented, so there is not much point in them existing (except that I still remember they exist).

Acribbs added 14 commits April 1, 2024 14:59

removed bam2libtype as its not core code and probably more pipeline s…

747027a

…pecific

have removed bam2peakshape because it is chip-seq specific and probab…

93e6839

…ly lies in a pipeline workflow rather than generic script

removed bam2UniquePairs because it doesnt seem like its used that fre…

1c4cd8a

…quently. Aiming to reduce code so that maintenance is reduced to core functionality

removed bed2annotator asit doesnt seem to be used

fb37636

removed bed2plot

34bf56c

removed cat_tables

7af4647

forgot to remove cat_tables and split_files tests

e558df3

removed randomize_lines

5d82c95

removed chain2pasl

fa7ca17

removed csv scripts and combine tables as this can all be done more e…

8c51aba

…fficiently using pandas

removed medip as this is better served in pipelines rather than here

0fdc171

removed rna as this is better placed in a pipeline

3707558

removed transfac script as its better placed in a pipeline

aaea0bc

removed cgat_fastq2cDNA as it seems redundant

c3a894a

Acribbs merged commit a4038f0 into master Apr 1, 2024
8 checks passed

Acribbs deleted the AC-refactor branch April 1, 2024 15:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ac refactor #122

Ac refactor #122

Acribbs commented Apr 1, 2024

Acribbs commented Apr 1, 2024

IanSudbery commented Apr 1, 2024

Acribbs commented Apr 1, 2024

Acribbs commented Apr 1, 2024

IanSudbery commented Apr 2, 2024

Acribbs commented Apr 4, 2024

Acribbs commented Apr 4, 2024

IanSudbery commented Apr 4, 2024

Acribbs commented Apr 4, 2024

Acribbs commented Apr 4, 2024

IanSudbery commented Apr 4, 2024 via email

Ac refactor #122

Ac refactor #122

Conversation

Acribbs commented Apr 1, 2024

Acribbs commented Apr 1, 2024

IanSudbery commented Apr 1, 2024

Acribbs commented Apr 1, 2024

Acribbs commented Apr 1, 2024

IanSudbery commented Apr 2, 2024

Acribbs commented Apr 4, 2024

Acribbs commented Apr 4, 2024

IanSudbery commented Apr 4, 2024

Acribbs commented Apr 4, 2024

Acribbs commented Apr 4, 2024

IanSudbery commented Apr 4, 2024 via email