-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ac refactor #122
Ac refactor #122
Conversation
…ly lies in a pipeline workflow rather than generic script
…quently. Aiming to reduce code so that maintenance is reduced to core functionality
…fficiently using pandas
@IanSudbery @sebastian-luna-valero @david-cgat @AndreasHeger @jscaber and everyone.. I realised today that pandas and numpy has features deprecated that caused our bam2bam to fail (plus other scripts). While fixing these issues I realised how bloated the cgat-apps is and have made an effort to reduce some code (Starting with scripts) so it can be better maintained. Im going to merge but please take a look at scripts that I have removed and if there are any that you think should be returned then let me know and I can revert some commits. |
I'm not against retiring old, unused scripts, but I'm not sure just pulling them without any process or warning is a good way forwards. I suggest we instigate some sort of life cycle policy. Mark tools with a deprecation warning some period before they are removed. We should probably also leave a stub in that says the tool has been removed when called, rather than just throwing an error. I can't speak to all the things removed here, but I know that we use bam2peakshape, which is useful for viewing any continuous signal over discrete regions. We've used it with ATAC-seq, TT-seq, mNET-seq just off the top of my head. Its also the example that we use in the paper to demonstrate the capabilities of the tool set. I'd also like to speak up for combine_tables, which I've used a lot in the past. I wouldn't assume that pandas is more efficient, as pandas is often very inefficient when it comes to memory usage. I also believe that in That said, I don't know how much anything is used these days, as I personally don't do much actual coding, and people in my lab find discovery a problem (where as most of this stuff is just in my head). |
Happy to re-introduce anything people find useful, I just dont have too much time to fix the updates that come from maintaining all the code now (Although I still use cgat-apps a lot), so reducing as much as possible would be best. It took a while to get the code up to standard to support python >3.9 today, lots of fixes related to updates in numpy and pandas. The reason for removing the bam2peakshape was because I thought that it was more ChIP-seq pipeline specific, but will revert as I wasn't sure if its used generally. Regarding |
The worrying thing was that we have a python >3.8 build on conda and it is likely broken because the code hasnt been maintained for a while. It was only today that I decided to give later versions of python (I usually pin to py 3.8) a try and found them broken. Then I went down a rabbit hole of trying to fix everything. |
I completely get where you are coming from! I'd love to say that I would put effort into maintaining them, but I know I'd never follow though. And I am sure there is lots of deadwood in there. I just feel like we should have some sort of defined process for removing things that are no longer used by anyone. When you say broken, do you mean they error, or that they no longer pass the regession tests? |
Ok see your point, from now on for scripts I can initiate a deprecation cycle. For scripts that you think are useful that I removed these can easily be reversed as i made sure i made one commit for each script removal. Or if you have pipelines that break, use a previous pin to conda for time being. The code was broken for any python version >3.8 because of changes to pandas and numpy libs. We only tested on github actions up to py3.8 so none of this was caught (I now test up to py10, but think we should probably build and test for up to py3.12 too). bioconda had a build for python 3.10 released for our old code and would break if used. The tests on bioconda are basic so this needs to be picked up in our testing. Thanksfully, bioconda blacklisted cgat-apps anyway because the builds were consuming too much RAM, because of lots of high dependancies. I did a code review of the other bits yesterday and have realised that there is so much code that just does nothing. I suspect that if I dont clean that up the code then we will get blacklisted again and its a pain to manually keep releasing cgat-apps recipe. |
Example deprecation warning given in #124, this ok? |
I think the deprecation messages are fine. I would roll a release before the deprecation goes ahead. Kind of a "this is the last release that will contain these functions". We should probably keep a table of what was deprecated and when. When we do the deprecation, we should probably replace the removed scripts/code with stubs that just print a deprecation message that then raise an error (your warnings could be replaced with errors for e.g., and the code itself just replaced with pass), for, say, a year. We definately don't want to be black listed on bioconda, so we need to do what it takes for that not to happen. What do you mean by "does nothing"? You mean functions that are not called elsewhere in cgat-apps or in the cgat-flow, or functions that literally just return their parameters unalter etc? |
Good idea regarding the table and can create one in the repo. Some code such as Line 1 in c931a5d
Line 1 in c931a5d
Line 1 in c931a5d
There are other bits of code that I am unsure if are useful, like this: Line 4 in c931a5d
|
Also various files such as install.sh (broken anyway), PKG-INFO, COPYING (not needed as have license) seem to be redundant. And then there is the documentation, which is a whole other story of messiness which would take a while to fix, the latest read the docs is broken so would need to fix that first, or host them on github. |
Yes, I think it depends how you use cgat-apps.
For me personally, I always used cgat-apps as a programming API as well as
a collection of scripts. I probably import modules from cgat more often
than I run the scripts (not that I do either much these days) - the GTF and
Bed modules in particular are absolutely gold, and remarkably provide
functionality I'm not aware of existing anywhere else.
Some of these things (like CBioportal and MEME) could conceivably be
standalone packages - they are not in cgat-apps because they are used in
either the scripts or the pipelines (although I think MEME is in the motifs
pipeline, if that still exists?), but because they provide functionality to
be used externally in their own right (We've used it in several pipelines).
The other two I'm not sure about, but I suspect the story is similar.
IGV.py I'm pretty sure was used in one of Andreas' original project
pipelines - back in the day when "cgat" was just a directory on /ifs where
everyone put all their code, and was included in cgat-apps because, well,
someone might find it useful.
You are right about the documentation, and this is a real barrier to anyone
who doesn't know the code intimately from using cgat-apps. I'm pretty sure
that those two modules above are entirely undocumented, so there is not
much point in them existing (except that I still remember they exist).
|
No description provided.