Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Demonstrating usefulness with an example #55

Open
theosanderson opened this issue Jul 9, 2024 · 0 comments
Open

Demonstrating usefulness with an example #55

theosanderson opened this issue Jul 9, 2024 · 0 comments

Comments

@theosanderson
Copy link

theosanderson commented Jul 9, 2024

I have been asked to review this manuscript for JOSS. openjournals/joss-reviews#6579

The JOSS guidelines require me to consider:

Whether the software is sufficiently useful that it is likely to be cited by your peer group.

I would echo concerns from other reviewers that the concrete tasks which aPhyloGeo aims to solve, and how it does so, are not clearly described in the manuscript.

Having read the group's prior paper about aPhylogeo (Koshkarov et al., 2022), I believe I now have a better idea about the aims of the software. I am surprised that this prior paper is not cited here -- indeed, it would be useful for the authors to set out what distinguishes this new contribution from that previous work, as the previous paper's abstract describes its contributions as introducing aPhyloGeo, and providing an example of its use. As far as I can see, this previous paper introducing the tool has not yet been cited -- though of course this does not rule out future citations of the paper under consideration.

As I understand the aim of the software, its creators hypothesise that the genetic relationship of sequences in a population might be driven by climatic factors. So for example, we might find that particular genetic sublineages of a population are favoured in hot countries, and others are favoured in cold countries. The authors propose to discover these relationships by making trees of genetic sequences, and making trees of the climatic conditions in the locations those sequences "represent", and assessing the similarities between the genetic and climate trees. Sometimes the genetic trees are constructed from small windows of genome rather than the full genome.

To my mind, to demonstrate that this tool is useful enough to be cited, the authors would need to show that it is able to detect such effects.

My gut instinct is to be surprised if such effects are readily detectable by this approach. Climatic conditions can only explain at most a small part of the dynamics that we see in lineage prevalence. For example the major variants of concern of SARS-CoV-2 (the authors' examples to my knowledge focus on this organism) spread across much of the world, including countries with very different climatic conditions. The detection of this small signal will be made much more difficult by the fact that it needs to be disentangled from a much larger signal: geographic relationships. If a new lineage, perhaps with increased fitness, arises in a particular country and spreads well there it will also spread into neighbouring countries, due to movement of infected people. It will be less likely to reach very distant countries. Neighbouring countries tend to share similar climatic conditions. So to demonstrate a climatic causality, the authors would need to show that the relationship with climate significantly exceeds that with geography.

My expectation is that disentangling these effects would require both sophisticated statistical approaches and very large datasets. The example in the tutorial uses just 5 sequences, which are a subset of the author's previous paper which used 37 sequences. Among the 37 sequences, the genetic groups include the "Q" lineages which are concentrated in Europe because they descend from B.1.1.7, which arose in the UK, and the "P" lineages which are concentrated in South America because they descend from a Brazilian lineage, B.1.1.28. Unfortunately I can't see how these trends could be disentangled, especially for a dataset of this small size.

These concerns unfortunately continue to apply even when the authors are slicing the genome into small windows and considering each in turn. Indeed, such an approach presents an additional need to control for multiple comparisons.

To demonstrate usefulness, in my eyes the authors would need to provide an example to show that the tool can discover true relationships between climate and genetics. There probably are a number of climate-associated adaptations that have been described in the literature. Some brief searching identified this paper: https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1001375 which identifies such trends in human populations (and does discuss controls for confounding with geographic proximity). Can aPhyloGeo rediscover these trends, or other previously known effects? Alternatively (less optimally) one could imagine analysing a simulated dataset, with plausible parameters, and showing that aPhyloGeo was able to uncover such relationships.

I note that the reviewer I am replacing had not yet ticked the following boxes, which may relate to similar concerns to those I outline here:

  • Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines
  • Functionality: Have the functional claims of the software been confirmed?
  • Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).

I am sorry if this issue comes across as negative. I commend the authors for releasing their code openly, and for engaging with the JOSS process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant