Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ac refactor #122

Merged
merged 14 commits into from
Apr 1, 2024
Merged

Ac refactor #122

merged 14 commits into from
Apr 1, 2024

Conversation

Acribbs
Copy link
Contributor

@Acribbs Acribbs commented Apr 1, 2024

No description provided.

@Acribbs
Copy link
Contributor Author

Acribbs commented Apr 1, 2024

@IanSudbery @sebastian-luna-valero @david-cgat @AndreasHeger @jscaber and everyone.. I realised today that pandas and numpy has features deprecated that caused our bam2bam to fail (plus other scripts). While fixing these issues I realised how bloated the cgat-apps is and have made an effort to reduce some code (Starting with scripts) so it can be better maintained. Im going to merge but please take a look at scripts that I have removed and if there are any that you think should be returned then let me know and I can revert some commits.

@Acribbs Acribbs merged commit a4038f0 into master Apr 1, 2024
8 checks passed
@Acribbs Acribbs deleted the AC-refactor branch April 1, 2024 15:18
@IanSudbery
Copy link
Contributor

I'm not against retiring old, unused scripts, but I'm not sure just pulling them without any process or warning is a good way forwards.

I suggest we instigate some sort of life cycle policy. Mark tools with a deprecation warning some period before they are removed. We should probably also leave a stub in that says the tool has been removed when called, rather than just throwing an error.

I can't speak to all the things removed here, but I know that we use bam2peakshape, which is useful for viewing any continuous signal over discrete regions. We've used it with ATAC-seq, TT-seq, mNET-seq just off the top of my head. Its also the example that we use in the paper to demonstrate the capabilities of the tool set.

I'd also like to speak up for combine_tables, which I've used a lot in the past. I wouldn't assume that pandas is more efficient, as pandas is often very inefficient when it comes to memory usage. I also believe that in --cat mode (in in how it is/was used in P.concatenateAndLoad), there is no reason why combine_tables shouldn't use effectively zero memory. The final advantage of these sorts of things, is while it might be true that there are now good ways of achieving the same thing in python (or R), sometimes you just want something you can throw in a bash statement without writing a new python script.

That said, I don't know how much anything is used these days, as I personally don't do much actual coding, and people in my lab find discovery a problem (where as most of this stuff is just in my head).

@Acribbs
Copy link
Contributor Author

Acribbs commented Apr 1, 2024

Happy to re-introduce anything people find useful, I just dont have too much time to fix the updates that come from maintaining all the code now (Although I still use cgat-apps a lot), so reducing as much as possible would be best. It took a while to get the code up to standard to support python >3.9 today, lots of fixes related to updates in numpy and pandas.

The reason for removing the bam2peakshape was because I thought that it was more ChIP-seq pipeline specific, but will revert as I wasn't sure if its used generally.

Regarding P.concatenateAndLoad, this should be standalone code now in cgat-core and it should be completely independent of cgat-apps, should be called P.concatenate_and_load now I think.

@Acribbs
Copy link
Contributor Author

Acribbs commented Apr 1, 2024

The worrying thing was that we have a python >3.8 build on conda and it is likely broken because the code hasnt been maintained for a while. It was only today that I decided to give later versions of python (I usually pin to py 3.8) a try and found them broken. Then I went down a rabbit hole of trying to fix everything.

@IanSudbery
Copy link
Contributor

I completely get where you are coming from! I'd love to say that I would put effort into maintaining them, but I know I'd never follow though.

And I am sure there is lots of deadwood in there.

I just feel like we should have some sort of defined process for removing things that are no longer used by anyone.

When you say broken, do you mean they error, or that they no longer pass the regession tests?

@Acribbs
Copy link
Contributor Author

Acribbs commented Apr 4, 2024

Ok see your point, from now on for scripts I can initiate a deprecation cycle.

For scripts that you think are useful that I removed these can easily be reversed as i made sure i made one commit for each script removal. Or if you have pipelines that break, use a previous pin to conda for time being.

The code was broken for any python version >3.8 because of changes to pandas and numpy libs. We only tested on github actions up to py3.8 so none of this was caught (I now test up to py10, but think we should probably build and test for up to py3.12 too). bioconda had a build for python 3.10 released for our old code and would break if used. The tests on bioconda are basic so this needs to be picked up in our testing. Thanksfully, bioconda blacklisted cgat-apps anyway because the builds were consuming too much RAM, because of lots of high dependancies. I did a code review of the other bits yesterday and have realised that there is so much code that just does nothing. I suspect that if I dont clean that up the code then we will get blacklisted again and its a pain to manually keep releasing cgat-apps recipe.

@Acribbs
Copy link
Contributor Author

Acribbs commented Apr 4, 2024

Example deprecation warning given in #124, this ok?

@IanSudbery
Copy link
Contributor

I think the deprecation messages are fine. I would roll a release before the deprecation goes ahead. Kind of a "this is the last release that will contain these functions".

We should probably keep a table of what was deprecated and when.

When we do the deprecation, we should probably replace the removed scripts/code with stubs that just print a deprecation message that then raise an error (your warnings could be replaced with errors for e.g., and the code itself just replaced with pass), for, say, a year.

We definately don't want to be black listed on bioconda, so we need to do what it takes for that not to happen.

What do you mean by "does nothing"?

You mean functions that are not called elsewhere in cgat-apps or in the cgat-flow, or functions that literally just return their parameters unalter etc?

@Acribbs
Copy link
Contributor Author

Acribbs commented Apr 4, 2024

Good idea regarding the table and can create one in the repo.
Going to work on the code changes now and see if I can remove some dependancies. already found a few such as futures, jinja2 and six.

Some code such as

'''CBioPortal.py - Interface with the Sloan-Kettering cBioPortal webservice
, , all completely redundant and the last one hasnt had a major update since 11 years ago. Its code that isn't used anywhere and doesnt seem to have any useful general functionality.

There are other bits of code that I am unsure if are useful, like this:

class MemeMotif:
. You updated last in 2020 so I assume its used in some capacity, but doesnt seems to be used in any scripts.

@Acribbs
Copy link
Contributor Author

Acribbs commented Apr 4, 2024

Also various files such as install.sh (broken anyway), PKG-INFO, COPYING (not needed as have license) seem to be redundant. And then there is the documentation, which is a whole other story of messiness which would take a while to fix, the latest read the docs is broken so would need to fix that first, or host them on github.

@IanSudbery
Copy link
Contributor

IanSudbery commented Apr 4, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants