My suspicion: with this script: plink --bfile myfile --maf 0.2 --matrix
you don't do enough quality control.
If you do your test on a small group, then you may have some local components or MAF that may take precedence over the global components.
Try for example this:
--geno 0.4 --mind 0.4 --maf 0.05
Here are the projected individuals.
I also added Macedonian IA.
The Bulgarian HO seems to be very homogenous.
However for the moment I can't provide any explanation for your results.
Is it by the first two eigenvectors? Try to add the coordinates of more eigenvectors of these samples (for example, the first 20, which are issued by default) to Vahaduo Admixture JS and see the total distances from the Bulgarians to other samples.
As this matrix is for similarity - 1 is the distance to itself, 0 - is for no similarity or most distant.
Depending on the selected reference we see all 7 Bulg. samples in the first 10 or max in the first 17. So they still form a cluster.
The data provided by plink is consistent.
(11-04-2023, 03:43 PM)TanTin Wrote: Here is the distance to B4:
As this matrix is for similarity - 1 is the distance to itself, 0 - is for no similarity or most distant.
Depending on the selected reference we see all 7 Bulg. samples in the first 10 or max in the first 17. So they still form a cluster.
The data provided by plink is consistent.
Maybe because you used --maf 0.2? I said in one of the previous posts that big parameters for --maf (0.1-0.3) improve the distances for Bulgarians. But is it correct?
11-04-2023, 05:03 PM (This post was last modified: 11-04-2023, 05:04 PM by TanTin.)
(11-04-2023, 04:42 PM)Gordius Wrote: Maybe because you used --maf 0.2? I said in one of the previous posts that big parameters for --maf (0.1-0.3) improve the distances for Bulgarians. But is it correct?
The standard parameter for maf is --maf 0.05 .
Minor allele frequency (MAF) is the frequency at which the second most common allele occurs in a given population. They play a surprising role in heritability since MAF variants which occur only once, known as "singletons", drive an enormous amount of selection.
I usually don't play with --maf, I keep it at 0.05.
PP. I used --maf 0.2 only to replicate your results.
11-04-2023, 05:37 PM (This post was last modified: 11-04-2023, 05:38 PM by Gordius.)
(11-04-2023, 05:03 PM)TanTin Wrote:
(11-04-2023, 04:42 PM)Gordius Wrote: Maybe because you used --maf 0.2? I said in one of the previous posts that big parameters for --maf (0.1-0.3) improve the distances for Bulgarians. But is it correct?
The standard parameter for maf is --maf 0.05 .
Minor allele frequency (MAF) is the frequency at which the second most common allele occurs in a given population. They play a surprising role in heritability since MAF variants which occur only once, known as "singletons", drive an enormous amount of selection.
I usually don't play with --maf, I keep it at 0.05.
PP. I used --maf 0.2 only to replicate your results.
But that matrix you posted was calculated using maf 0.2? Because by using 0.05 I didnt have normal results for Bulgarians. Also turned out to be unnecessary the commands --geno 0.4 and --mind 0.4, for example:
Quote:597573 variants loaded from .bim file.
67 people (60 males, 7 females) loaded from .fam.
67 phenotype values loaded from .fam. 0 people removed due to missing genotype data (--mind).
Using up to 2 threads (change this with --threads).
Before main variant filters, 67 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.994945. 0 variants removed due to missing genotype data (--geno).
214830 variants removed due to minor allele threshold(s)
(--maf/--max-maf/--mac/--max-mac).
382743 variants and 67 people pass filters and QC.
Among remaining phenotypes, 0 are cases and 67 are controls.
Excluding 2143 variants on non-autosomes from distance matrix calc.
Distance matrix calculation complete.
IBS matrix written to plink.mibs , and IDs to plink.mibs.id .
(11-03-2023, 04:14 AM)Kale Wrote: That they plot together on G25 does not mean they have unique drift. It just means they behave similarly in relation to the other samples that compose the PCA.
They can be far from each other but must be closer to each other than to Lithuanians and Estonians. But this is not observed. The only command that slightly "improves" the distances for the Bulgarians is if I make the maf parameter https://zzz.bwh.harvard.edu/plink/thresh.shtml#maf very large, from 0.1 to 0.3, for example:
plink --bfile myfile --maf 0.2 --matrix
But I'm not sure this is the right approach.
Maybe one of us is misinterpreting the other, here's a concrete example of what I'm talking about...
Mbuti.DG England_Yarnton_BellBeaker England_Netheravon_BellBeaker Latvia_Zvejnieki_HG 0.00089 2.54 843450
Mbuti.DG England_Netheravon_BellBeaker England_Yarnton_BellBeaker Latvia_Zvejnieki_HG 0.00087 2.46 843450
Despite the two BellBeakers being part of the same admixture event(s) and having the same ratio of components (EEF, WHG, Steppe), because those components are significantly different from each other, the Beakers attract more to the most bottlenecked of their components than to each other.
11-04-2023, 06:40 PM (This post was last modified: 11-04-2023, 06:57 PM by TanTin.)
(11-04-2023, 05:37 PM)Gordius Wrote: But that matrix you posted was calculated using maf 0.2? Because by using 0.05 I didnt have normal results for Bulgarians. Also turned out to be unnecessary the commands --geno 0.4 and --mind 0.4, for example:
plink --bfile mydata --geno 0.4 --mind 0.4 --maf 0.05 --matrix
Quote:597573 variants loaded from .bim file.
67 people (60 males, 7 females) loaded from .fam.
67 phenotype values loaded from .fam. 0 people removed due to missing genotype data (--mind).
Using up to 2 threads (change this with --threads).
Before main variant filters, 67 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.994945. 0 variants removed due to missing genotype data (--geno).
214830 variants removed due to minor allele threshold(s)
(--maf/--max-maf/--mac/--max-mac).
382743 variants and 67 people pass filters and QC.
Among remaining phenotypes, 0 are cases and 67 are controls.
Excluding 2143 variants on non-autosomes from distance matrix calc.
Distance matrix calculation complete.
IBS matrix written to plink.mibs , and IDs to plink.mibs.id .
--geno 0.4 --mind 0.4 : these are a standard quality control, to make sure that all the individuals included in the test have at least 60% and also only these 60% will be included in the calculations.
Same with --maf 0.2 --> this will change the alleles participating in the calculations.
In some cases where we have a good coverage for all the individulas <<--geno 0.4 --mind 0.4 >> may not change anything - all the individuals may pass the QC. (Quality control ) .. However this is rare, most of the time many individulas will be rejected for not passing QC.
Hello everyone. I am currently processing some raw commercial data, and the first batch of 11 samples are all from FTDNA. I have been able to make binary plink files without issue, but when I started to merge the data to make a consolidated dataset I have been running into problems. The usual merge conflicts occurred, whereby flipping and removing variants that could not be flipped work in the sense that plink is able to merge the data. However, I now have a situation where there are a fair few variants that have the same position. The error message reads: Warning: Variants 'abc' and 'xyz' have the same position.
After checking the variants in the bim files, they either have the same alleles, or they are flipped. I haven't checked all of them, so some could have different alleles. There are some 2200 variant pairs with the same position.
I assume this is caused by different chips/human genome builds used at the time of testing. Is there a simple way to address this issue, or is it a case of living with less than ideal data?...