Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Galaxy and datatype management #13

Open
bgruening opened this issue May 31, 2015 · 7 comments
Open

Galaxy and datatype management #13

bgruening opened this issue May 31, 2015 · 7 comments

Comments

@bgruening
Copy link
Member

I would like to discuss the management of datatypes in Galaxy with in more detail.

Since the last GCC we integrated a lot of new tools, with a lot of new datatypes. I don't feel comfortable with the current situation of datatype handling and I don't think we have communicated it very well to
other tool developers.

The aim of this discussion is to come up with a 'best-practise' guide and put it in https://github.com/galaxy-iuc/standards. I haven't started the discussion as github issue, as I think this problem is more
complicated and a video conference would fit better.

As it stands:

  • You can define your own datatypes and deploy them via the TS.

    Simple example: https://toolshed.g2.bx.psu.edu/view/devteam/emboss_datatypes

    More complicated example: https://toolshed.g2.bx.psu.edu/view/iuc/molecule_datatypes
  • TS datatypes are currently versioned, but the version is not used in
    Galaxy. Multiple versions can result in conflicts.
  • Datatypes do not have namespaces. The above mentioned molecule_datatypes are also defined in one other package. This results in conflicts.
  • At some point we agreed to disable versioning of datatypes. Like repositories of type "tool_dependency_definition". I don't think this ever happened.
  • We also agreed at some point to only put one datatype in one repository and communicate to our users to only use this one datatype (with one version only). I started this here:
    https://github.com/bgruening/galaxytools/tree/master/datatypes/emboss_datatypes but it was blocking the one-version-one-datatype fix in the TS.
  • Because namespaces are hard to implement in the current Galaxy we tried to consolidate all our datatypes into main Galaxy. We moved all SNPEff datatypes into core, as well as Peptide-Shaker specific datatypes and so on. I'm keen to move molecule_datatypes as well into core.
  • With Add emboss datatypes [WIP] galaxyproject/galaxy#148 it doesn't seem so easy, (as expected, but and I really appreciate this discussion)
  • We had an idea to add more meta-data to datatypes, which can be set by a tool and can guide other tools. This will have a huge impact on upload data and history import/export. Overall this was considered as a hack.
  • Datatypes can have huge dependencies, like biopython or openbabel ... it would be nice to make use of these libraries to improve sniffing (BAM --> samtools)

TOP's:

  • Highlight current weaknesses
  • EDAM ontology and the way forward https://trello.com/c/TRgSWfT1
  • Discuss potential fixes/workarounds
  • Implications for planemo and testing
  • Formulate datatype best practise guide
  • Go trough the EMBOSS list and decide how to go further
  • Datatype hackathon at GCC?

Overall, I think datatypes are a great feature of Galaxy, especially if we can enhance/fix parallelism in many more datatypes, use EDAM everywhere etc ... but I really think we need to define a clear way forward.

As soon as we all agree on a time for a hangout I will update this Issue and invite everyone who wants to join.

Ciao,
Bjoern

@hexylena
Copy link
Member

However datatypes happen, I want them under test!

There are datatypes in galaxy that currently won't ever match a dataset and no one really knows this because they've never had any test data. (ppm/pbm/pgm, all three get detected as the same type by PIL)

@bgruening
Copy link
Member Author

According to our votes we will have our meeting on Thursday the 11.06.2015 at 16:00 UTC.

If all works as expected this is will be our hangout: https://plus.google.com/hangouts/_/calendar/YmpvZXJuLmdydWVuaW5nQGdtYWlsLmNvbQ.j69j1vmi47lni193vr8l7mvo58?authuser=0

Everyone is invited to discuss with us the handling of datatypes in Galaxy.
I will update the description above with more links and examples, so please have a look at it before.

Looking forward to a fruitful discussion!

@bgruening
Copy link
Member Author

Just a short reminder that this meeting will start in about 4 hours.

@jj-umn
Copy link

jj-umn commented Jun 11, 2015

Have an appointment at 15:30 will try to join as soon as I am able.

@hexylena
Copy link
Member

Meeting participants: @bgruening @jj-umn @davebx @nekrut @jmchilton @blankenberg @erasche (Sorry if I missed someone :/)

Meeting outcomes:

  • @blankenberg tentatively offered to add a new datatype repository type to the TS which only has a single installable revision
  • @jmchilton suggested a single git repo for datatypes to which anyone wanting to contribute datatypes could. Similar to the homebrew repositories.
  • @blankenberg raised the issue that it's not a nice UX for users of (e.g.) a climate data galaxy to see all of the biology data types.
  • @bgruening recommended that we activate datatypes on demand as a compromise. They'd always be loaded, just activated the first time they're seen.
  • Dan
    • least amount of work to move to core, fixes load order, central authority is good.
    • He believes that it wouldn't be too horrible to add TS dependency specifications to core datatypes.
    • "Not going to -1 a datatype if it's blocking [new useful tools], but [he] doesn't believe it's the right way to fix it"
  • John argues that improving the TS datatype handling isn't worth the effort.
  • James offered the issue of TS datatypes having sync issues between MTS and TTS
  • John recommends we keep sticking things in core?
  • Bjoern brought up emboss datatypes requiring Biopython, Dan immediately shot down a biopython egg, apparently having missed the reply to the public galaxy-dev mailing post/IRC discussion.
  • John brought up that the TS is terrible at ecosystem discovery. It should be a single place where shopping is a good experience and you can find all of relevant viz plugins for your tools that you're installing.
  • John mentioned having a new dependency resolver of some sort, wherein in Galaxy will have a list of canonical dependencies that need to be installed and available to converters/sniffers/etc (@jmchilton, can you double check this point?)
  • Bjoern mentioned his emboss PR, Dan replied that he wasn't -1ing.

(at this point I lost my connection, and missed any further proceedings)

@jmchilton
Copy link
Contributor

Action Items:

  • @jmchilton agreed to implement a mapping file (probably distributed in core but extensible) - that would map generic dependencies (like samtools X) to canonical repositories (e.g. iuc/package_samtools_X).
    • Additionally provide a small interface for viewing and installing these.
    • Possible project for the hackathon.
  • @jmchilton agreed to implement the ability for datatypes and viz components to resolve these dependencies during command-line string generation.
  • devteam agreed to merge emboss datatypes.

@jmchilton
Copy link
Contributor

@blankenberg didn't shoot down biopython - he brought up it might be tough to get it in - he is definitely not a -1 if it is useful. I am working on the egg - I think we should just proceed as if it were going to be merged if it is useful until someone speaks up and says it won't be merged. I'm working on the egg now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants