Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data candy #29

Open
nevrome opened this issue Aug 27, 2021 · 3 comments
Open

Data candy #29

nevrome opened this issue Aug 27, 2021 · 3 comments
Labels
enhancement New feature or request

Comments

@nevrome
Copy link
Member

nevrome commented Aug 27, 2021

With the growing data collection in this repository and the tools we wrote to access it, we could relatively easily prepare some automatic pipelines to construct useful, derived data products. One way to set this up would be to create a github repo, that gets updated automatically with a clever github action, whenever the master branch in published_data changes.

Some ideas:

  • Pairwise-distance matrices with multiple distance measures. This is especially important, given that many individuals are represented multiple times in this dataset. So far we do not offer a workflow to remove duplicates (or biologically related individuals).
  • An MDS with all ancient individuals.
  • Various data quality and -completeness indices.

Theoretically we could also produce figures and interactive toys - then the sky is the limit. I would suggest to stick to the basic necessities, though.

@stschiff
Copy link
Member

Love it! Of course, things like all-pairwise distances likely require High Performance computing environments, so I'd be curious how we can practically set this up. But the general idea to keep a separate GitHub-repo with such results is super nice!

@nevrome
Copy link
Member Author

nevrome commented Sep 20, 2021

I'm pretty optimistic here. Pruning and pairwise distance calculation for 3000 individuals takes about 2 (!!) seconds on the MPI-EVA cluster. Even given that calculating distances for 12000 individuals is 16 times more work and Github actions only provide limited computing power, this could still be possible.

@stschiff
Copy link
Member

Yes indeed. This would be very nice.

@stschiff stschiff added the enhancement New feature or request label Dec 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants