Hello guest, if you read this it means you are not registered. Click here to register in a few simple steps, you will enjoy all features of our Forum.

Check for new replies
Sampling by Country for YFull and FTDNA
#1
Repost from the foruim version 1.0 plus additional data for FTDNA.

YFull data

I recently came across the sampling ratio for different countries based on YFull. It is noteworthy that the sampling in some countries is so high, that big surprises become less likely, while in other countries there are truly huge gaps. Here is the map:

phylogeographer.com/yfull-world-sampling-rate-map/?fbclid=IwAR19ab58aKPPbbnJUbCOZkaXgcNiX17aHgpn0fM58yUHv6q2WEIl8ldFDRs

For Europe and the Near East, some of the highest ratios are from the Southern Arab states, especially Saudi Arabia which tops nearly everybody else. In Europe the best sampling on YFull being achieved in Ireland, Albania, Montenegro, Armenia, Sweden and Finland. On the other end of the spectrum are Moldova, Romania and France primarily, but also Austria, Germany, Ukraine and Spain.

The ratios get even worse if considering that a large fraction of the testers from Romania, Moldova and Ukraine are Ashkenazi Jewish or other ethnic minorities, with the main ethnic group of these states being even more severely underrepresented. The higher testing frequency of Ashkenazi (which is per se a good thing!) is also an issue for Austria and Germany, because it further lowers the local ethnic ratios.

Especially if talking the results from YFull at face value, these ratios put things into perspective. The chances for big surprises in some areas (like Ireland and Finland) are way smaller than in other areas (like Romania and France).

FTDNA data (STR tested individuals - many, especially many Germans left out, because they have no country of origin in their profiles)

Some countries have significantly higher or lower ratios at FTDNA, but some basic gaps remain. Here are the numbers for tested individuals from those countries in which I have (usually distant) STR matches, sorted by most to least tested individuals:

1. 
England
46896
2. 
Ireland
32824
3. 
United States
28827
4. 
Germany
25345
5. 
Scotland
22883
6. 
United Kingdom
16263
7. 
Saudi Arabia
11762
8. 
Russian Federation
10817
9. 
Sweden
10072
10. 
Finland
9159
11. 
Poland
8620
12. 
France
8248
13. 
Italy
6878
14. 
Spain
6130
15. 
Norway
4705
16. 
Switzerland
4147
17. 
Ukraine
4087
18. 
Wales
3712
19. 
Netherlands
3375
20. 
Iraq
3242
21. 
Northern Ireland
3068
22. 
Hungary
2422
23. 
Lithuania
2290
24. 
Portugal
2220
25. 
Mexico
1990
26. 
Czech Republic
1954
27. 
Turkey
1912
28. 
Denmark
1860
29. 
Belarus
1810
30. 
United States (Native American)
1773
31. 
Canada
1731
32. 
Greece
1673
33. 
Austria
1450
34. 
Romania
1110
35. 
Slovakia
1096
36. 
Armenia
1068
37. 
Belgium
1025
38. 
Syrian Arab Republic
869
39. 
Georgia
856
40. 
Bulgaria
847
41. 
Qatar
831
42. 
Indonesia
707
43. 
Libya
656
44. 
Morocco
608
45. 
Croatia
575
46. 
Latvia
541
47. 
Brazil
464
48. 
Puerto Rico
460
49. 
Albania
447
50. 
Slovenia
406
51. 
Serbia
402
52. 
Azerbaijan
379
53. 
Philippines
363
54. 
Bosnia and Herzegovina
352
55. 
Tunisia
322
56. 
Moldova
203
57. 
Macedonia
192
58. 
Iceland
191
59. 
Jamaica
188
60. 
Montenegro
162
61. 
England (Cornish)
144
62. 
Luxembourg
141
63. 
Dominican Republic
118
64. 
Kosovo
107
65. 
Russia (Republic of Adygea)
88
66. 
Ethiopia
74
67. 
Liechtenstein
14
68. 
Italy (South Tyrol)
1

The biggest discrepancy is visible in the Balkans, with Albanians in particular being much worse tested on FTDNA than on YFull. Still if summing Albania and Kosovo up, the ratio per total population is not bad at all.
For countries like Ukraine, Romania, Poland, Austria, Germany etc. the very high proportion of Ashkenazi Jewish testers with origins in those countries remains a problem for properly assessing the regional ethnic testing frequency.

Ireland and Saudi Arabia are among the best tested countries worldwide, both on YFull and FTDNA.
rmstevens2, leonardo, Strider99 And 4 others like this post
Reply
#2
(10-04-2023, 05:33 PM)Riverman Wrote: Repost from the foruim version 1.0 plus additional data for FTDNA.

YFull data

I recently came across the sampling ratio for different countries based on YFull. It is noteworthy that the sampling in some countries is so high, that big surprises become less likely, while in other countries there are truly huge gaps. Here is the map:

phylogeographer.com/yfull-world-sampling-rate-map/?fbclid=IwAR19ab58aKPPbbnJUbCOZkaXgcNiX17aHgpn0fM58yUHv6q2WEIl8ldFDRs

For Europe and the Near East, some of the highest ratios are from the Southern Arab states, especially Saudi Arabia which tops nearly everybody else. In Europe the best sampling on YFull being achieved in Ireland, Albania, Montenegro, Armenia, Sweden and Finland. On the other end of the spectrum are Moldova, Romania and France primarily, but also Austria, Germany, Ukraine and Spain.

The ratios get even worse if considering that a large fraction of the testers from Romania, Moldova and Ukraine are Ashkenazi Jewish or other ethnic minorities, with the main ethnic group of these states being even more severely underrepresented. The higher testing frequency of Ashkenazi (which is per se a good thing!) is also an issue for Austria and Germany, because it further lowers the local ethnic ratios.

Especially if talking the results from YFull at face value, these ratios put things into perspective. The chances for big surprises in some areas (like Ireland and Finland) are way smaller than in other areas (like Romania and France).

FTDNA data (STR tested individuals - many, especially many Germans left out, because they have no country of origin in their profiles)

Some countries have significantly higher or lower ratios at FTDNA, but some basic gaps remain. Here are the numbers for tested individuals from those countries in which I have (usually distant) STR matches, sorted by most to least tested individuals:

1. 
England
46896
2. 
Ireland
32824
3. 
United States
28827
4. 
Germany
25345
5. 
Scotland
22883
6. 
United Kingdom
16263
7. 
Saudi Arabia
11762
8. 
Russian Federation
10817
9. 
Sweden
10072
10. 
Finland
9159
11. 
Poland
8620
12. 
France
8248
13. 
Italy
6878
14. 
Spain
6130
15. 
Norway
4705
16. 
Switzerland
4147
17. 
Ukraine
4087
18. 
Wales
3712
19. 
Netherlands
3375
20. 
Iraq
3242
21. 
Northern Ireland
3068
22. 
Hungary
2422
23. 
Lithuania
2290
24. 
Portugal
2220
25. 
Mexico
1990
26. 
Czech Republic
1954
27. 
Turkey
1912
28. 
Denmark
1860
29. 
Belarus
1810
30. 
United States (Native American)
1773
31. 
Canada
1731
32. 
Greece
1673
33. 
Austria
1450
34. 
Romania
1110
35. 
Slovakia
1096
36. 
Armenia
1068
37. 
Belgium
1025
38. 
Syrian Arab Republic
869
39. 
Georgia
856
40. 
Bulgaria
847
41. 
Qatar
831
42. 
Indonesia
707
43. 
Libya
656
44. 
Morocco
608
45. 
Croatia
575
46. 
Latvia
541
47. 
Brazil
464
48. 
Puerto Rico
460
49. 
Albania
447
50. 
Slovenia
406
51. 
Serbia
402
52. 
Azerbaijan
379
53. 
Philippines
363
54. 
Bosnia and Herzegovina
352
55. 
Tunisia
322
56. 
Moldova
203
57. 
Macedonia
192
58. 
Iceland
191
59. 
Jamaica
188
60. 
Montenegro
162
61. 
England (Cornish)
144
62. 
Luxembourg
141
63. 
Dominican Republic
118
64. 
Kosovo
107
65. 
Russia (Republic of Adygea)
88
66. 
Ethiopia
74
67. 
Liechtenstein
14
68. 
Italy (South Tyrol)
1

The biggest discrepancy is visible in the Balkans, with Albanians in particular being much worse tested on FTDNA than on YFull. Still if summing Albania and Kosovo up, the ratio per total population is not bad at all.
For countries like Ukraine, Romania, Poland, Austria, Germany etc. the very high proportion of Ashkenazi Jewish testers with origins in those countries remains a problem for properly assessing the regional ethnic testing frequency.

Ireland and Saudi Arabia are among the best tested countries worldwide, both on YFull and FTDNA.

There are dorkymon threads about Romanian hapo groups which isn't that much but it s a start
lg16 likes this post
Reply
#3
There are some samples appearing and disappearing in YFull. There was a sample of IJK*, it was removed a day later. There was a sample from Serbia R1*. A day later it was removed. Now two strange samples have appeared. One from India R1 and one from the Netherlands R1b- M269*. Wondering what's going on at YFull?
rmstevens2 likes this post
Reply
#4
If they don't pay, they get deleted.
rmstevens2 likes this post
Reply
#5
(10-12-2023, 07:18 AM)VladMC Wrote: There are some samples appearing and disappearing in YFull. There was a sample of IJK*, it was removed a day later. There was a sample from Serbia R1*. A day later it was removed. Now two strange  samples have appeared. One from India R1 and one from the Netherlands R1b- M269*. Wondering what's going on at YFull?

The problem is that many kits coming from Dante Labs seem to have corrupted / contaminated raw data which leads to them falsely appearing high up in the YFull tree at basal positions. It isn't YFull's fault, it's a catastrophic failure of Dante Labs to send their customers garbage genomic raw data and reports based on such data.
rmstevens2, leonardo, pelop And 3 others like this post
Reply
#6
At this point, if you have a Dante Labs result with the MGI sequencer flow plate ID in the 6,000 range (see attached image), then you likely have a nonsense result. The sequencer ID is a part of every read segment name in the FASTQ and BAM files. There are usually 660 million or more such segments.
The nonsense results exhibit the following properties:
(a) Female samples appearing more like male for having much larger than expected mappings to the Y chromosome and lower on the X
(b) Y haplogroup measurements at the basal IJK level due to very low variant calls of the Y chromosome
© Mito haplogroup measurement at the R haplogroup due to similar reasons
(d) Much larger heterozygous results after pileup than a normal, human genome exhibits (or can exist as the implied codons / amino acids in the known genes are nonsense)
(e) Autosomal segment matching (as done in genealogy and forensics) yields unrealistic match results between known good samples (same sample or unrelated; microarray test directly or extracted from a WGS result).
The result does not appear corrupted from a typical DNA measurement and maybe sequencer QA test. It will still map well to the human genome. And appears to have the right balance of ATCG entries in both the FASTQs and BAMs. Thus not clearly appearing like (a) corrupt DNA as exists in the low mapping results delivered over the past four years (https://bit.ly/2U7s34t), or (b) swapped samples when people get the wrong gender result or the wrong expected result when compared to other DNA testing across the whole genome (WES, WGS or Microarray).
You can determine your sequencer ID from the stats page of the current 4.44.2 version of WGS Extract (which does not recognize the new E200 flow plate ID; as shown in the image) or by executing one of the following commands in a shell / terminal:
samtools view your.bam | head -1 | cut -f1
zcat your.fastq.gz | head -1
The nonsense results use kit / file IDs starting with TSA and GFX. TSA is a new kit ID convention not seen before. Unlike the old, mostly used 608xxx (all numerals), or newer GFX. (DTxx is also used for Dante USA samples processed by FTDNA.)
It is still not clear of this nonsense result cause but the current hypothesis is a mixing of multiple DNA samples that comprised the output. Samples are normally mixed on a flow cell / plate but have unique tags added to distinguish them in post processing.
https://www.facebook.com/groups/consumer...172141205/
ChrisR, rmstevens2, 23abc like this post
Reply
#7
(10-12-2023, 10:04 AM)VladMC Wrote: At this point, if you have a Dante Labs result with the MGI sequencer flow plate ID in the 6,000 range (see attached image), then you likely have a nonsense result.
...
You can determine your sequencer ID from the stats page of the current 4.44.2 version of WGS Extract (which does not recognize the new E200 flow plate ID; as shown in the image)
The nonsense results use kit / file IDs starting with TSA and GFX. TSA is a new kit ID convention not seen before.  Unlike the old, mostly used 608xxx (all numerals), or newer GFX. (DTxx is also used for Dante USA samples processed by FTDNA.)

Thanks! Just to be sure I checked on two results with no TSA/GFX in the IDs, first one year old (Illumina NS 6000) and the other almost 4 years old (Illumina Novaseq 6000), both done in IT. So fortunately no newer flawed result. Seems WGS for an affordable price (and with results in 2-4 months) remains a difficult task.
rmstevens2 likes this post
---
Main Projects
: Tyrol DNA, Alpine DNA, J2-M172, J2a-M67, J2a-PF5197, ISOGG Wiki, GenWiki;
Focus on Y-DNA: J2a-M67-L210, J2a-PF5197-PF5169, R1a-M17, R1b-U106-Z372
Reply

Check for new replies

Forum Jump:


Users browsing this thread: 1 Guest(s)