Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GitHub Action to build and upload Conda package whenever a new release is made #67

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

matthewwiese
Copy link
Member

@matthewwiese matthewwiese commented Dec 8, 2023

Description

This branch adds an additional GitHub Action that is triggered whenever a release is made, which in our case is done by the other Action in this repository.

It uses a repository secret, CONDA_TOKEN, generated for the maize-genetics Anaconda group. Its scope is limited to API read and write access.

The existing action for building a release is modified to update two new repository variables, PHG2_VERSION and PHG2_VERSION_MD5. These are used in the Conda build process to pull the most recent release (that which triggers this action).

This will allow us to keep the Conda package consistent with the newest GitHub release without the need to manually build and upload each time. It is particularly helpful if there is a new version that fixes some bug - this way you don't have to download and extract a tarball to update, simply doing conda update phg2 instead to retrieve the latest version.

Type of change

What type of changes does your code introduce? Put an x in boxes that apply.

  • CHANGE (fix or feature that would cause existing functionality to not work as expected)
  • FEATURE (non-breaking change which adds functionality)
  • BUGFIX (non-breaking change which fixes an issue)
  • ENHANCEMENT (non-breaking change which improves existing functionality)
  • NONE (if none of the other choices apply. Example, tooling, build system, CI, docs, etc.)

Checklist:

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have updated relevant documentation

Changelog entry

Please add a one-line changelog entry below. This will be copied to the changelog file during the release process.

GitHub Action to build and upload Conda package whenever a new release is made

@matthewwiese matthewwiese changed the title Do not merge this! Testing Actions for automatic Conda build + upload GitHub Action to build and upload Conda package whenever a new release is made Dec 8, 2023
conda/meta.yaml Outdated Show resolved Hide resolved
uses: conda-incubator/setup-miniconda@v3
with:
auto-update-conda: true
python-version: 3.11
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to use python 3.11 versus something newer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The newest version is 3.12 (3.13 is still prerelease) so we are pretty "bleeding edge" here. The Python version doesn't really matter all that much for this anyway, I simply included it so that it's explicit in case it breaks in the future.

Hugging Face uses 3.8 in their v2 usage. If you go to the setup-miniconda repo their examples include a variety of versions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake/ I was thinking this was 2.11, not 3.11. Ignore my comment/

with:
name: 'PHG2_VERSION'
value: '${{ env.VERSION }}'
token: ${{secrets.PHGV2CD}}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize "secrets.<>" is used in existing code. Where are the secrets values stored (ie what file is accessed?). Do we have these created for maizegenetics.net, or is this per project?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind - I see from your slack posting this was generated for maize-genetics

PHG2_VERSION_MD5: ${{ vars.PHG2_VERSION_MD5 }}

jobs:
build-upload-conda:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will you update the README.md installation section with information on how to install this? Brandon is writing the detailed documentation, but individually we are updating the simple usage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure can do!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know what you think.

We can add more to the Quick Start in future - I wasn't sure how much we wanted included since stuff is still in flux.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is my concern: If they pull the apps into the base conda environment, then the programs are seen via ProcessBuilder(). I think this is what you originally tested. I did not understand if this worked when our Clikt commands precede the ProcessBuilder() commands with a conda environment setting. Do all environments inherit what is in the base environment? if yes, then this was ok.

But I think we agreed we don't want to encourage adding to the base environment.

If users create a new conda environment and load into that this will not work with the ProcessBuilder() commands unless the environment is one that we know. So either the user should always create a conda environment named phgv2-conda , or they will have to pass a new paramter with the conda environment name to every class that has a ProcessBuilder call.

let me know if I'm missing something here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see here for a response - I wanted to make sure everybody was on the same page and understood the discussion from this morning.

@matthewwiese
Copy link
Member Author

@lynnjo @zrm22 @aberthel @btmonier @pjbradbury @tcasstevens

For everybody's benefit and to clear up any confusion, I want to respond to Lynn's question from here in a more visible way so that we are all on the same page.

If you look in the source, the ProcessBuilder commands follow this form:

conda run -n phgv2-conda agc

The above runs the given program (in this case agc) within the Conda environment specified by the -n parameter, in this case the phgv2-conda environment that is created by phg2 setup-environment. This allows you to run any software installed in a given environment without having to be in the environment itself. Please refer to the conda run docs here.

From a system with a brand new Conda installation, issue the following commands:

conda create --name my_env --channel conda-forge --channel maize-genetics phg2
conda activate my_env
phg2 setup-environment
conda run -n phgv2-conda anchorwave --help

You should be greeted with the familiar AnchorWave help info. Furthermore, we can confirm the behavior I describe via:

conda run -n phgv2-conda conda list

Which will run Conda within the phgv2-conda environment, listing all packages installed into said environment (all without having to be in the environment itself). The first line output should look something like the following depending on your OS:

# packages in environment at /home/matt/miniconda3/envs/phgv2-conda:

What I was trying to describe in Programmer Meeting is that if the dependencies are a part of the package installation itself, we wouldn't have to rely on a bespoke phgv2-conda environment. ProcessBuilder would find the software installed alongside PHG2, whether that be in the base environment or a user's own custom environment.

One might argue that users of the plain tarball application will no longer have access to the supplied dependencies, and to that I say: it's not a problem! For example, AnchorWave doesn't install minimap2 for you despite depending on it. We ought to provide a singular "blessed" installation method (Conda) while also providing a generic release (the tarballs) for advanced users. If somebody has little technical or bioinformatics knowledge, we direct them to installing the package via Conda, which will manage the dependencies for them and allow for a straightforward upgrade path. Advanced users can install the tarball and add its executable to their PATH if they wish - such users ought to be expected to have the requisite knowledge of downloading/building from source the other dependencies that we require.

Additionally, there are engineering concerns that I didn't have the time to explain at the meeting. Let's say a bug is discovered in one of our dependencies included in the phgv2-conda environment which requires an update to some latest version - existing users will have to manually delete and recreate the phgv2-conda environment, or enter it and update the problematic package themselves.

We should be using our tools/dependencies how they were designed to be used, doing it any other way will inevitably lead to bugs and breakage, which in a small lab is particularly harmful due to everybody's limits on time. Much like the unit and integration tests, thoughtful design and engineering will ultimately save us time and lead to better science.


```
conda update phg2
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on these changes, the documentation indicates the user can only execute phg from the conda setup (the original ./phg commands have been removed).

Did we decide Conda is an option, or did we decide conda is the only option for installing phg?

If the latter, the instructions in the PHGv2- Building and Loading have an inconsistency. Is our program named "phg" or "phg2" ? (we need to make a collective decision)

It looks like we'll have 2 environments - one that contains the phg(2) executable, and one that is created by phg(2) to run agc, tiledbvcf, etc. Is that correct?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I'm not opposed to loading phg only via conda. I just want to ensure our documentation has consistent examples and that we're all on the same page.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we decide Conda is an option, or did we decide conda is the only option for installing phg?

I don't think we decided anything, this is a question for @zrm22 or the group, same as the naming question.

It looks like we'll have 2 environments - one that contains the phg(2) executable, and one that is created by phg(2) to run agc, tiledbvcf, etc. Is that correct?

There is only a single environment, phgv2-conda. The other environment containing the actual program (the Conda package) isn't really relevant as I wrote in my long comment above.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for now I would prefer explaining both the conda install and tarball installation instructions.

I agree that long term if the user installs through conda we should only have one environment(no more need for SetupEnvironment) with all the dependencies, but I think it gets tricky as every time we make a call to anchorwave, tileDB, bgzip or bcftools we need to wrap the command that ProcessBuilder executes within a conda run command where we need to specify the environment name.

The way we have this setup its consistent as we give the environment a name. If the user needs to create their environment and then just add the PHG conda env to it, we no longer have this control.

To fix either we need to have Processbuilder first figure out what envs are there and try to determine which one is the PHG one(what happens if you have more than 1?) or we have the user submit the name as a param(this would need to basically be in every command). We may be able to have the user create an Environment variable that gets picked up by Clikt automatically, but it is hard to say.

@@ -1,4 +1,7 @@
# PHG version 2
> [!TIP]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put this tip under the badges.

@@ -27,56 +54,3 @@ The redesign leverages the powerful TileDB-VCF database, which is widely used in
Composite Reference Haplotypes

More information on terminology can be found [here](docs/terminology.md).

# Example usage
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These need to be brought back in. We are making an effort so the user does not need to jump through 3-4 pages just to get basic information to run the software. Obviously this is not all the documentation, but having the 10-15 commands that need to be run at the main page of the Repo is a good idea for what we are trying to do here.

Copy link
Collaborator

@zrm22 zrm22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to have the example usage section brought back into the README. We are trying to show how simple it is to run through the pipeline and in my opinion the best way to do that is to show the full basic pipeline at the homepage of the repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants