Login

TanTin · 11-04-2023, 12:10 AM

My suspicion: with this script: plink --bfile myfile --maf 0.2 --matrix
you don't do enough quality control.
If you do your test on a small group, then you may have some local components or MAF that may take precedence over the global components.

Try for example this:
--geno 0.4 --mind 0.4 --maf 0.05

Gordius · 11-04-2023, 08:53 AM

(11-03-2023, 11:04 PM)TanTin Wrote:

Here are the projected individuals.
I also added Macedonian IA.
The Bulgarian HO seems to be very homogenous.
However for the moment I can't provide any explanation for your results.

Is it by the first two eigenvectors? Try to add the coordinates of more eigenvectors of these samples (for example, the first 20, which are issued by default) to Vahaduo Admixture JS and see the total distances from the Bulgarians to other samples.

Gordius · 11-04-2023, 09:18 AM

For example distances to Lithuanian by first 20 eigenvectors:

Quote:Distance to: lithuania10
0.21992272 LithuanianF1
0.23010972 lithuania9
0.23790990 lithuania2
0.27941602 lithuania8
0.27978699 LithuanianD1
0.28869115 lithuania3
0.29008908 LithuanianA1
0.29180564 Est377
0.31681006 LithuanianE2
0.32302504 Est375
0.34971117 HungarianH3
0.35593507 bel23s
0.36251756 lithuania1
0.39305695 UKR-1909
0.40088393 bel30s
0.41034067 UkrBel620
0.41140210 UkrBel733
0.42557209 UKR-1291
0.43179432 Est372
0.43302449 Est358
0.44097723 Est400
0.44789508 bel72c
0.45102730 UkrBel622
0.45467893 belarusian47zp
0.49839865 UKR-1292
0.50793217 UKR-1283
0.50842442 bel93c
0.50961822 UkrBel736
0.52640566 Est391
0.53767254 Est380
0.54087130 UKR-1399
0.54411328 bel110c
0.54765919 UKR-1951
0.57932469 hungary2
0.58321401 belarusian23vp
0.59995255 HungarianE5
0.61374997 HungarianC5
0.61923150 bel8s
0.62460403 UKR-1992
0.63077519 hungary6
0.65304883 UKR-1377
0.65337990 UKR-1978
0.70064242 UKR-2021
0.70689451 UKR-1913
0.74323183 HungarianD1
0.75284958 Mordovians5
0.75758444 BulgarianE2
0.78186581 hungary20
0.78553852 Mordovians17
0.78848709 Mordovians4
0.80026391 Mordovians22
0.80420545 Mordovians32
0.81629567 BulgarianC1
0.81990822 Mordovians1
0.83756722 hungary7
0.85073960 BulgarianB1
0.86101512 hungary15
0.86746591 Mordovians28
0.87787477 BulgarianF2
0.89899291 Mordovians31
0.90180353 Mordovians30
0.91198894 BulgarianF1
0.92964598 BulgarianB4
0.95119180 BulgarianH2
1.01875079 BulgarianA1
1.02491928 UKR-1903

and distances to Bulgarian:

Quote:Distance to: BulgarianA1
0.94572703 bel23s
0.95293104 UKR-1283
0.96015833 UkrBel733
0.96115894 lithuania1
0.97496356 Est372
0.98025472 Est377
0.98806378 UkrBel620
0.99441262 LithuanianA1
0.99987176 UkrBel622
1.00253444 HungarianH3
1.00324444 Est400
1.01396627 LithuanianF1
1.01465857 UKR-1909
1.01780595 UKR-1291
1.01802762 UKR-1399
1.01875079 lithuania10
1.02053261 lithuania2
1.02063376 UKR-1377
1.02156346 Est380
1.02405598 UkrBel736
1.02456470 LithuanianD1
1.02630024 belarusian47zp
1.02732391 lithuania3
1.03526752 UKR-1292
1.04393044 bel72c
1.04629657 LithuanianE2
1.05044500 Est358
1.05205747 bel30s
1.05438172 UKR-1913
1.05723287 HungarianE5
1.05898276 lithuania9
1.06065183 Est391
1.06106476 Est375
1.07248727 lithuania8
1.07996697 bel8s
1.08298743 UKR-1992
1.08705730 hungary6
1.09322048 hungary2
1.09340936 belarusian23vp
1.09686941 UKR-1951
1.10124049 bel93c
1.10332894 bel110c
1.11099548 HungarianD1
1.11641227 UKR-2021
1.12891238 HungarianC5
1.13083368 BulgarianE2
1.13614893 UKR-1978
1.14489540 hungary20
1.15792759 Mordovians4
1.19971026 Mordovians5
1.20967144 BulgarianF2
1.21453157 Mordovians32
1.21688617 BulgarianB1
1.21999189 BulgarianC1
1.22396943 hungary7
1.22480933 Mordovians22
1.22707414 Mordovians1
1.23437588 Mordovians17
1.23695995 Mordovians28
1.23967449 BulgarianF1
1.24327888 Mordovians30
1.27802247 hungary15
1.28774509 BulgarianB4
1.28813018 Mordovians31
1.30018591 BulgarianH2
1.38615026 UKR-1903

Gordius · 11-04-2023, 09:23 AM

Here is attached IBS matrix in Excel. You can also see the distances between different samples.

TanTin · 11-04-2023, 03:22 PM

You should use on the first place the distance matrix, that you just generated.
I tried to reproduce your results.
Here is my code:

Quote:system("plink --memory 12000 --threads 2 --bfile ../V54.1_HO/v54.1_HO_public --keep 35D/temp1_IND.txt --maf 0.2 --matrix ")

List.results <- read.table("plink.mibs", header = F)
List.columns = read.table("plink.mibs.id", header = F)
List.rows = read.table("plink.mibs.id", header = F)
row.names( List.results ) = List.columns$V2
colnames(List.results )= List.columns$V2

Here are the distances directly from the distance matrix:

[Image: bulgarian-HO1.png]

TanTin · 11-04-2023, 03:43 PM

Here is the distance to B4:

[Image: bulgarian-HO-B4.png]

As this matrix is for similarity - 1 is the distance to itself, 0 - is for no similarity or most distant.
Depending on the selected reference we see all 7 Bulg. samples in the first 10 or max in the first 17. So they still form a cluster.
The data provided by plink is consistent.

Gordius · 11-04-2023, 04:42 PM

(11-04-2023, 03:43 PM)TanTin Wrote: Here is the distance to B4:

As this matrix is for similarity - 1 is the distance to itself, 0 - is for no similarity or most distant.
Depending on the selected reference we see all 7 Bulg. samples in the first 10 or max in the first 17. So they still form a cluster.
The data provided by plink is consistent.

Maybe because you used --maf 0.2? I said in one of the previous posts that big parameters for --maf (0.1-0.3) improve the distances for Bulgarians. But is it correct?

TanTin · (This post was last modified: 11-04-2023, 05:04 PM by TanTin.)

(11-04-2023, 04:42 PM)Gordius Wrote: Maybe because you used --maf 0.2? I said in one of the previous posts that big parameters for --maf (0.1-0.3) improve the distances for Bulgarians. But is it correct?

The standard parameter for maf is --maf 0.05 .

Minor allele frequency (MAF) is the frequency at which the second most common allele occurs in a given population. They play a surprising role in heritability since MAF variants which occur only once, known as "singletons", drive an enormous amount of selection.

I usually don't play with --maf, I keep it at 0.05.

PP. I used --maf 0.2 only to replicate your results.

Gordius · (This post was last modified: 11-04-2023, 05:38 PM by Gordius.)

(11-04-2023, 05:03 PM)TanTin Wrote:
(11-04-2023, 04:42 PM)Gordius Wrote: Maybe because you used --maf 0.2? I said in one of the previous posts that big parameters for --maf (0.1-0.3) improve the distances for Bulgarians. But is it correct?

The standard parameter for maf is --maf 0.05 .

Minor allele frequency (MAF) is the frequency at which the second most common allele occurs in a given population. They play a surprising role in heritability since MAF variants which occur only once, known as "singletons", drive an enormous amount of selection.

I usually don't play with --maf, I keep it at 0.05.

PP. I used --maf 0.2 only to replicate your results.

But that matrix you posted was calculated using maf 0.2? Because by using 0.05 I didnt have normal results for Bulgarians. Also turned out to be unnecessary the commands --geno 0.4 and --mind 0.4, for example:

plink --bfile mydata --geno 0.4 --mind 0.4 --maf 0.05 --matrix

Quote:597573 variants loaded from .bim file.
67 people (60 males, 7 females) loaded from .fam.
67 phenotype values loaded from .fam.
0 people removed due to missing genotype data (--mind).
Using up to 2 threads (change this with --threads).
Before main variant filters, 67 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.994945.
0 variants removed due to missing genotype data (--geno).
214830 variants removed due to minor allele threshold(s)
(--maf/--max-maf/--mac/--max-mac).
382743 variants and 67 people pass filters and QC.
Among remaining phenotypes, 0 are cases and 67 are controls.
Excluding 2143 variants on non-autosomes from distance matrix calc.
Distance matrix calculation complete.
IBS matrix written to plink.mibs , and IDs to plink.mibs.id .

Kale · 11-04-2023, 06:02 PM

(11-03-2023, 10:23 PM)Gordius Wrote:
(11-03-2023, 04:14 AM)Kale Wrote: That they plot together on G25 does not mean they have unique drift. It just means they behave similarly in relation to the other samples that compose the PCA.

They can be far from each other but must be closer to each other than to Lithuanians and Estonians. But this is not observed. The only command that slightly "improves" the distances for the Bulgarians is if I make the maf parameter https://zzz.bwh.harvard.edu/plink/thresh.shtml#maf very large, from 0.1 to 0.3, for example:

plink --bfile myfile --maf 0.2 --matrix

But I'm not sure this is the right approach.

Maybe one of us is misinterpreting the other, here's a concrete example of what I'm talking about...
Mbuti.DG England_Yarnton_BellBeaker England_Netheravon_BellBeaker Latvia_Zvejnieki_HG 0.00089 2.54 843450
Mbuti.DG England_Netheravon_BellBeaker England_Yarnton_BellBeaker Latvia_Zvejnieki_HG 0.00087 2.46 843450

Despite the two BellBeakers being part of the same admixture event(s) and having the same ratio of components (EEF, WHG, Steppe), because those components are significantly different from each other, the Beakers attract more to the most bottlenecked of their components than to each other.

TanTin · (This post was last modified: 11-04-2023, 06:57 PM by TanTin.)

(11-04-2023, 05:37 PM)Gordius Wrote: But that matrix you posted was calculated using maf 0.2? Because by using 0.05 I didnt have normal results for Bulgarians. Also turned out to be unnecessary the commands --geno 0.4 and --mind 0.4, for example:
plink --bfile mydata --geno 0.4 --mind 0.4 --maf 0.05 --matrix

Quote:597573 variants loaded from .bim file.
67 people (60 males, 7 females) loaded from .fam.
67 phenotype values loaded from .fam.
0 people removed due to missing genotype data (--mind).
Using up to 2 threads (change this with --threads).
Before main variant filters, 67 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.994945.
0 variants removed due to missing genotype data (--geno).
214830 variants removed due to minor allele threshold(s)
(--maf/--max-maf/--mac/--max-mac).
382743 variants and 67 people pass filters and QC.
Among remaining phenotypes, 0 are cases and 67 are controls.
Excluding 2143 variants on non-autosomes from distance matrix calc.
Distance matrix calculation complete.
IBS matrix written to plink.mibs , and IDs to plink.mibs.id .

--geno 0.4 --mind 0.4 : these are a standard quality control, to make sure that all the individuals included in the test have at least 60% and also only these 60% will be included in the calculations.
Same with --maf 0.2 --> this will change the alleles participating in the calculations.
In some cases where we have a good coverage for all the individulas <<--geno 0.4 --mind 0.4 >> may not change anything - all the individuals may pass the QC. (Quality control ) .. However this is rare, most of the time many individulas will be rejected for not passing QC.

mqas · 12-17-2023, 08:16 AM

Hello everyone. I am currently processing some raw commercial data, and the first batch of 11 samples are all from FTDNA. I have been able to make binary plink files without issue, but when I started to merge the data to make a consolidated dataset I have been running into problems. The usual merge conflicts occurred, whereby flipping and removing variants that could not be flipped work in the sense that plink is able to merge the data. However, I now have a situation where there are a fair few variants that have the same position. The error message reads: Warning: Variants 'abc' and 'xyz' have the same position.

After checking the variants in the bim files, they either have the same alleles, or they are flipped. I haven't checked all of them, so some could have different alleles. There are some 2200 variant pairs with the same position.

I assume this is caused by different chips/human genome builds used at the time of testing. Is there a simple way to address this issue, or is it a case of living with less than ideal data?...

Login
Username/Email:
Password:	Lost Password?
	Remember me