Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing Countries in four packages #96

Open
stschiff opened this issue Oct 1, 2022 · 13 comments
Open

Missing Countries in four packages #96

stschiff opened this issue Oct 1, 2022 · 13 comments
Assignees

Comments

@stschiff
Copy link
Member

stschiff commented Oct 1, 2022

The following four packages have missing Country entries:

trident list --individuals -d . -j Country --raw | awk '$4 == "n/a"' | cut -f1 | sort | uniq -c
  4 2014_RaghavanScience
  20 2020_Nakatsuka_SouthPatagonia
  40 2021_Kilinc_northeastAsia
 383 2021_Wang_EastAsia
   4 Reference_Genomes

Obviously, the last one should have n/a, but the others should have proper Countries. Should be easy to fix by checking the original papers. @dhananjaya93 (@93Boy) perhaps you could get to that. Thanks.

@93Boy
Copy link
Contributor

93Boy commented Oct 6, 2022

I will look into this

@93Boy
Copy link
Contributor

93Boy commented Oct 12, 2022

2014_RaghavanScience was updated through #99 and 2020_Nakatsuka_SouthPatagonia was updated through #100

@93Boy
Copy link
Contributor

93Boy commented Nov 2, 2022

383 Individuals of 2021_Wang_EastAsia does not contain in supplementary documents. It has only 169 newly reported ancient samples but Poseidon already has 191 samples with complete information other than 383 samples mentioned above. @AyGhal can you give me any hint regarding this?

@nevrome
Copy link
Member

nevrome commented Jul 26, 2023

@AyGhal and I looked into this.

2021_Kilinc_northeastAsia is a bare bones package with almost no information in the .janno file. So we should add information way beyond just the Country. This information could be extracted either from the paper supplement or from the AADR.

Same is true for the modern samples in 2021_Wang_EastAsia. Information for these modern ones can be found in the HO version of the AADR dataset here.

@93Boy
Copy link
Contributor

93Boy commented Nov 16, 2023

I have went through the AADR data set mentioned above and "2021_Kilinc_northeastAsia" has only 2 entries in AADR. From those 2 entries only "N2a" has a matching PoseidonID. But 2021_Wang_EastAsia has data for almost all the modern samples. I will upload the data.

@AyGhal
Copy link
Contributor

AyGhal commented Nov 16, 2023

All the individuals for "2021_Kilinc_northeastAsia" should be in AADR. Try looking for the publication "KilincSciAdv2021". They have added "_noUDG.SG" to the IDs.

@93Boy
Copy link
Contributor

93Boy commented Nov 16, 2023

Got the information. I missed those entries since they were categorized under 2018 data, instead of 2021 in AADR

@93Boy
Copy link
Contributor

93Boy commented Nov 17, 2023

I have added information partially in KilincSciAdv2021 via the PR #147. but I have encountered some confusing points while curating AADR data. Hope you can help me clearing

  • AADR has 2 Y_Haplogroup information. Y haplogroup (manual curation in terminal mutation format) and Y haplogroup (manual curation in ISOGG format) while later is more reliable according to google. Which one should I use?
  • Method of determining date is Direct: IntCal20. but the mean and the SD of data is suspicious.
  • There are numerous library types in a single entry. E.g. ds.minus,ds.minus,ds.minus,ds.minus,ds.minus,ds.minus,ds.minus,ds.minus,ds.minus Is this a normal situation?

@stschiff
Copy link
Member Author

Thanks, @93Boy. Some replies:

  • Re Y-haplogroups. This is what the schema says: "please follow syntax with main branch + most terminal derived Y-SNP (e.g. R1b-P312)". Can someone advise whether that is actually ISOGG format? @AyGhal @TCLamnidis ?
  • I don't understand your Date question. I think that simply means the date type should be "direct", right @nevrome ?
  • Re libraries. Poseidon Schema to the rescue. As you can see here, Library_Built is a list-field and allows multiple entries, which should be consistent with Nr_Libraries. Please separate by semi-colon.

@AyGhal
Copy link
Contributor

AyGhal commented Nov 21, 2023

If you get the janno info from AADR_v54_1_p1_1240K_BeyondAncient-0.1.2 @nevrome has already converted it to our format. AADR_Y_Haplogroup_ISOGG is there. Also there are Library_Built and Nr_Libraries and that is the original AADR_Library_Type.

@nevrome
Copy link
Member

nevrome commented Nov 21, 2023

What @AyGhal says.

The aadr-archive should already have everything according to my decisions with the code available here. Please note the .csv file I compiled with a summary of the anno2janno mapping here.

So to answer the concrete questions:

  • Y haplogroup (manual curation in terminal mutation format) is the one that fits to our requirements. To my understanding this is not the ISOGG format.
  • IntCal20 is the calibration curve. We don't have a column for this information. See the old age string parser script here for how I extract the age information from the AADR.
  • What the AADR summarises as library types is split across two columns in the .janno file: UDG and Library_Built. See the code to pull the information apart here.

If this does all make sense to you and you do not see any mistake in my code, then you can probably just copy the info from the respective aadr-archive packages, @93Boy.

@93Boy
Copy link
Contributor

93Boy commented Nov 23, 2023

Y haplogroup (manual curation in terminal mutation format) is almost empty or doesn't have meaningful data in AADR but manual curation in ISOGG format has values. May I use these data?
My concern about the UDG and Library type data is a single genetic_ID contains multiple library information. E.g: brn008_noUDG.SG ds.plus,ds.plus,ds.plus,ds.minus,ds.minus . I have not seen this kind of pattern in previous Poseidon data

@stschiff
Copy link
Member Author

As discussed in a meeting, list data for libraries is supported by the schema. But it's not necessary to take this over from AADR for now. We are just keen to get the Country data and other missing data in for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants