Missing Countries in four packages #96

stschiff · 2022-10-01T18:52:46Z

The following four packages have missing Country entries:

trident list --individuals -d . -j Country --raw | awk '$4 == "n/a"' | cut -f1 | sort | uniq -c
  4 2014_RaghavanScience
  20 2020_Nakatsuka_SouthPatagonia
  40 2021_Kilinc_northeastAsia
 383 2021_Wang_EastAsia
   4 Reference_Genomes

Obviously, the last one should have n/a, but the others should have proper Countries. Should be easy to fix by checking the original papers. @dhananjaya93 (@93Boy) perhaps you could get to that. Thanks.

The text was updated successfully, but these errors were encountered:

93Boy · 2022-10-06T22:11:04Z

I will look into this

93Boy · 2022-10-12T13:14:06Z

2014_RaghavanScience was updated through #99 and 2020_Nakatsuka_SouthPatagonia was updated through #100

93Boy · 2022-11-02T18:59:57Z

383 Individuals of 2021_Wang_EastAsia does not contain in supplementary documents. It has only 169 newly reported ancient samples but Poseidon already has 191 samples with complete information other than 383 samples mentioned above. @AyGhal can you give me any hint regarding this?

nevrome · 2023-07-26T14:17:55Z

@AyGhal and I looked into this.

2021_Kilinc_northeastAsia is a bare bones package with almost no information in the .janno file. So we should add information way beyond just the Country. This information could be extracted either from the paper supplement or from the AADR.

Same is true for the modern samples in 2021_Wang_EastAsia. Information for these modern ones can be found in the HO version of the AADR dataset here.

93Boy · 2023-11-16T12:31:05Z

I have went through the AADR data set mentioned above and "2021_Kilinc_northeastAsia" has only 2 entries in AADR. From those 2 entries only "N2a" has a matching PoseidonID. But 2021_Wang_EastAsia has data for almost all the modern samples. I will upload the data.

AyGhal · 2023-11-16T13:54:17Z

All the individuals for "2021_Kilinc_northeastAsia" should be in AADR. Try looking for the publication "KilincSciAdv2021". They have added "_noUDG.SG" to the IDs.

93Boy · 2023-11-16T20:49:56Z

Got the information. I missed those entries since they were categorized under 2018 data, instead of 2021 in AADR

93Boy · 2023-11-17T21:17:00Z

I have added information partially in KilincSciAdv2021 via the PR #147. but I have encountered some confusing points while curating AADR data. Hope you can help me clearing

AADR has 2 Y_Haplogroup information. Y haplogroup (manual curation in terminal mutation format) and Y haplogroup (manual curation in ISOGG format) while later is more reliable according to google. Which one should I use?
Method of determining date is Direct: IntCal20. but the mean and the SD of data is suspicious.
There are numerous library types in a single entry. E.g. ds.minus,ds.minus,ds.minus,ds.minus,ds.minus,ds.minus,ds.minus,ds.minus,ds.minus Is this a normal situation?

stschiff · 2023-11-21T07:20:48Z

Thanks, @93Boy. Some replies:

Re Y-haplogroups. This is what the schema says: "please follow syntax with main branch + most terminal derived Y-SNP (e.g. R1b-P312)". Can someone advise whether that is actually ISOGG format? @AyGhal @TCLamnidis ?
I don't understand your Date question. I think that simply means the date type should be "direct", right @nevrome ?
Re libraries. Poseidon Schema to the rescue. As you can see here, Library_Built is a list-field and allows multiple entries, which should be consistent with Nr_Libraries. Please separate by semi-colon.

AyGhal · 2023-11-21T07:58:27Z

If you get the janno info from AADR_v54_1_p1_1240K_BeyondAncient-0.1.2 @nevrome has already converted it to our format. AADR_Y_Haplogroup_ISOGG is there. Also there are Library_Built and Nr_Libraries and that is the original AADR_Library_Type.

nevrome · 2023-11-21T09:17:36Z

What @AyGhal says.

The aadr-archive should already have everything according to my decisions with the code available here. Please note the .csv file I compiled with a summary of the anno2janno mapping here.

So to answer the concrete questions:

Y haplogroup (manual curation in terminal mutation format) is the one that fits to our requirements. To my understanding this is not the ISOGG format.
IntCal20 is the calibration curve. We don't have a column for this information. See the old age string parser script here for how I extract the age information from the AADR.
What the AADR summarises as library types is split across two columns in the .janno file: UDG and Library_Built. See the code to pull the information apart here.

If this does all make sense to you and you do not see any mistake in my code, then you can probably just copy the info from the respective aadr-archive packages, @93Boy.

93Boy · 2023-11-23T22:29:10Z

Y haplogroup (manual curation in terminal mutation format) is almost empty or doesn't have meaningful data in AADR but manual curation in ISOGG format has values. May I use these data?
My concern about the UDG and Library type data is a single genetic_ID contains multiple library information. E.g: brn008_noUDG.SG ds.plus,ds.plus,ds.plus,ds.minus,ds.minus . I have not seen this kind of pattern in previous Poseidon data

stschiff · 2023-11-29T16:38:04Z

As discussed in a meeting, list data for libraries is supported by the schema. But it's not necessary to take this over from AADR for now. We are just keen to get the Country data and other missing data in for now.

stschiff assigned 93Boy Nov 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing Countries in four packages #96

Missing Countries in four packages #96

stschiff commented Oct 1, 2022

93Boy commented Oct 6, 2022

93Boy commented Oct 12, 2022

93Boy commented Nov 2, 2022

nevrome commented Jul 26, 2023

93Boy commented Nov 16, 2023

AyGhal commented Nov 16, 2023

93Boy commented Nov 16, 2023

93Boy commented Nov 17, 2023

stschiff commented Nov 21, 2023

AyGhal commented Nov 21, 2023

nevrome commented Nov 21, 2023

93Boy commented Nov 23, 2023

stschiff commented Nov 29, 2023

Missing Countries in four packages #96

Missing Countries in four packages #96

Comments

stschiff commented Oct 1, 2022

93Boy commented Oct 6, 2022

93Boy commented Oct 12, 2022

93Boy commented Nov 2, 2022

nevrome commented Jul 26, 2023

93Boy commented Nov 16, 2023

AyGhal commented Nov 16, 2023

93Boy commented Nov 16, 2023

93Boy commented Nov 17, 2023

stschiff commented Nov 21, 2023

AyGhal commented Nov 21, 2023

nevrome commented Nov 21, 2023

93Boy commented Nov 23, 2023

stschiff commented Nov 29, 2023