10-08-2023, 11:42 PM
--sample-diff perhaps???
Never messed with it
Never messed with it
PLINK. Questions.
|
10-08-2023, 11:42 PM
--sample-diff perhaps???
Never messed with it
It seems that IBS and --distance (actually an analogue of IBS) are to some extent indicators of distances between samples. As far as I understand, clustering and pca are based on IBS distances. I have an unexplainable situation with IBS. I separated a small one with several populations from the Reich HO file. Among them were Lithuanians, Estonians, Ukrainians, Belarusians, Hungarians, Bulgarians and Mordovians. When I built the IBS matrix and sorted it, for example, for Lithuanians, Lithuanians were mostly closest, and the distance is relatively close. For the Estonians, the Estonians were the closest, and the distances are also close. But for Bulgarians, the nearest were not necessarily Bulgarians, but various samples, including Lithuanians, Estonians, Bulgarians, etc. The same for Hungarians and Mordovians. And the distances were large, three times greater than that of Lithuanians or Estonians to the nearest samples. Probably some problem arises with some populations. I thought that maybe this is the result of a large percentage of missing calls among the same Bulgarians, but no, I checked the missing calls by --missing, there the percentage is about the same as among the Lithuanians. What could be the problem?
10-31-2023, 05:45 PM
(10-27-2023, 09:19 PM)Gordius Wrote: It seems that IBS and --distance (actually an analogue of IBS) are to some extent indicators of distances between samples. As far as I understand, clustering and pca are based on IBS distances. I have an unexplainable situation with IBS. I separated a small one with several populations from the Reich HO file. Among them were Lithuanians, Estonians, Ukrainians, Belarusians, Hungarians, Bulgarians and Mordovians. When I built the IBS matrix and sorted it, for example, for Lithuanians, Lithuanians were mostly closest, and the distance is relatively close. For the Estonians, the Estonians were the closest, and the distances are also close. But for Bulgarians, the nearest were not necessarily Bulgarians, but various samples, including Lithuanians, Estonians, Bulgarians, etc. The same for Hungarians and Mordovians. And the distances were large, three times greater than that of Lithuanians or Estonians to the nearest samples. Probably some problem arises with some populations. I thought that maybe this is the result of a large percentage of missing calls among the same Bulgarians, but no, I checked the missing calls by --missing, there the percentage is about the same as among the Lithuanians. What could be the problem? Could the effect that applies to f-stat derived calculations be at play here? Where members of a recently admixed population, whos sources are relatively divergent, will be closer to the sources than to each other. (10-31-2023, 05:45 PM)Kale Wrote:(10-27-2023, 09:19 PM)Gordius Wrote: It seems that IBS and --distance (actually an analogue of IBS) are to some extent indicators of distances between samples. As far as I understand, clustering and pca are based on IBS distances. I have an unexplainable situation with IBS. I separated a small one with several populations from the Reich HO file. Among them were Lithuanians, Estonians, Ukrainians, Belarusians, Hungarians, Bulgarians and Mordovians. When I built the IBS matrix and sorted it, for example, for Lithuanians, Lithuanians were mostly closest, and the distance is relatively close. For the Estonians, the Estonians were the closest, and the distances are also close. But for Bulgarians, the nearest were not necessarily Bulgarians, but various samples, including Lithuanians, Estonians, Bulgarians, etc. The same for Hungarians and Mordovians. And the distances were large, three times greater than that of Lithuanians or Estonians to the nearest samples. Probably some problem arises with some populations. I thought that maybe this is the result of a large percentage of missing calls among the same Bulgarians, but no, I checked the missing calls by --missing, there the percentage is about the same as among the Lithuanians. What could be the problem? Bulgarians cannot be called a recently mixed population. Rather, it can be mixed, but not as much as, for example, Brazilians, who have components of Europeans, Africans and Amerindians in different proportions for different people. Some of those Bulgarian samples are on G25, they are definitely much closer to each other than to Lithuanians and Estonians. Not sure what the problem is, I've already used the command --geno to filter out the snps that might be present in some samples and not in the others, but without result.
11-02-2023, 02:17 PM
(10-31-2023, 09:25 PM)Gordius Wrote:(10-31-2023, 05:45 PM)Kale Wrote:(10-27-2023, 09:19 PM)Gordius Wrote: It seems that IBS and --distance (actually an analogue of IBS) are to some extent indicators of distances between samples. As far as I understand, clustering and pca are based on IBS distances. I have an unexplainable situation with IBS. I separated a small one with several populations from the Reich HO file. Among them were Lithuanians, Estonians, Ukrainians, Belarusians, Hungarians, Bulgarians and Mordovians. When I built the IBS matrix and sorted it, for example, for Lithuanians, Lithuanians were mostly closest, and the distance is relatively close. For the Estonians, the Estonians were the closest, and the distances are also close. But for Bulgarians, the nearest were not necessarily Bulgarians, but various samples, including Lithuanians, Estonians, Bulgarians, etc. The same for Hungarians and Mordovians. And the distances were large, three times greater than that of Lithuanians or Estonians to the nearest samples. Probably some problem arises with some populations. I thought that maybe this is the result of a large percentage of missing calls among the same Bulgarians, but no, I checked the missing calls by --missing, there the percentage is about the same as among the Lithuanians. What could be the problem? Cluster identification with plink is made based on more principal components, not only the first 2. What you see on the first 2 PCs doesn't show the extra details from other hidden components. To be able to see all the details for the clusters you should check further details. plink is not perfect for the cluster identification and it doesn't provide many options to specify the criteria for cluster identification. There are many other tools for hierarchical clustering. However plink is very powerful tool to generate PCA data, same data can be used by the other hierarchical clustering apps.
11-02-2023, 05:46 PM
(11-02-2023, 02:17 PM)TanTin Wrote:(10-31-2023, 09:25 PM)Gordius Wrote:(10-31-2023, 05:45 PM)Kale Wrote:(10-27-2023, 09:19 PM)Gordius Wrote: It seems that IBS and --distance (actually an analogue of IBS) are to some extent indicators of distances between samples. As far as I understand, clustering and pca are based on IBS distances. I have an unexplainable situation with IBS. I separated a small one with several populations from the Reich HO file. Among them were Lithuanians, Estonians, Ukrainians, Belarusians, Hungarians, Bulgarians and Mordovians. When I built the IBS matrix and sorted it, for example, for Lithuanians, Lithuanians were mostly closest, and the distance is relatively close. For the Estonians, the Estonians were the closest, and the distances are also close. But for Bulgarians, the nearest were not necessarily Bulgarians, but various samples, including Lithuanians, Estonians, Bulgarians, etc. The same for Hungarians and Mordovians. And the distances were large, three times greater than that of Lithuanians or Estonians to the nearest samples. Probably some problem arises with some populations. I thought that maybe this is the result of a large percentage of missing calls among the same Bulgarians, but no, I checked the missing calls by --missing, there the percentage is about the same as among the Lithuanians. What could be the problem? Are you about PCA? Yes, I know that on average each eigenvector of the first 20 only captures about 1% (plus or minus) of the total variability, so its results should be treated very carefully. But I didn't mean PСA, but IBS matrix (as well as a similar distance matrix). In these matrices, the total difference between genotypes should be calculated.
11-02-2023, 06:07 PM
(11-02-2023, 05:46 PM)Gordius Wrote: Are you about PCA? Yes, I know that on average each eigenvector of the first 20 only captures about 1% (plus or minus) of the total variability, so its results should be treated very carefully. But I didn't mean PСA, but IBS matrix (as well as a similar distance matrix). In these matrices, the total difference between genotypes should be calculated. https://www.cog-genomics.org/plink/1.9/strat https://zzz.bwh.harvard.edu/plink/strat.shtml Quote:Dimension reduction PCA is just a reduction from the distance matrix. Howeve if you wish you may increase the number of principal components and to use a lot more. There is such option to specify the number of PCs. By default, --pca extracts the top 20 principal components of the variance-standardized relationship matrix; (11-02-2023, 06:07 PM)TanTin Wrote: PCA is just a reduction from the distance matrix. Howeve if you wish you may increase the number of principal components and to use a lot more. You didn't quite understand me. In this case, I am just less interested in PСA. I'm just interested in the matrix as a result of the following commands: plink --bfile mydata --matrix plink --bfile mydata --distance plink --bfile mydata --distance-matrix According to the results some populations from the HO file behave strangely.
11-02-2023, 07:40 PM
(11-02-2023, 07:24 PM)Gordius Wrote:(11-02-2023, 06:07 PM)TanTin Wrote: PCA is just a reduction from the distance matrix. Howeve if you wish you may increase the number of principal components and to use a lot more. by using: --distance-matrix you will get: plink.mdist as result Quote:The default behavior of --matrix to to output similarities (proportions of alleles IBS). To generate a distance matrix (1-IBS) then use the command So: IBS is the similarities matrix // plink.mibs - symmetric matrix of the IBS distances (similarities) 1-IBS is the distance matrix Quote:Backwards compatibility plink.mdist is used as a raw data for the PCA.
11-02-2023, 09:09 PM
11-03-2023, 04:14 AM
(10-31-2023, 09:25 PM)Gordius Wrote:(10-31-2023, 05:45 PM)Kale Wrote:(10-27-2023, 09:19 PM)Gordius Wrote: It seems that IBS and --distance (actually an analogue of IBS) are to some extent indicators of distances between samples. As far as I understand, clustering and pca are based on IBS distances. I have an unexplainable situation with IBS. I separated a small one with several populations from the Reich HO file. Among them were Lithuanians, Estonians, Ukrainians, Belarusians, Hungarians, Bulgarians and Mordovians. When I built the IBS matrix and sorted it, for example, for Lithuanians, Lithuanians were mostly closest, and the distance is relatively close. For the Estonians, the Estonians were the closest, and the distances are also close. But for Bulgarians, the nearest were not necessarily Bulgarians, but various samples, including Lithuanians, Estonians, Bulgarians, etc. The same for Hungarians and Mordovians. And the distances were large, three times greater than that of Lithuanians or Estonians to the nearest samples. Probably some problem arises with some populations. I thought that maybe this is the result of a large percentage of missing calls among the same Bulgarians, but no, I checked the missing calls by --missing, there the percentage is about the same as among the Lithuanians. What could be the problem? Sure, maybe recent wasn't the best term. Populations without post-admixture bottlenecks. That they plot together on G25 does not mean they have unique drift. It just means they behave similarly in relation to the other samples that compose the PCA. (11-02-2023, 09:09 PM)TanTin Wrote:(11-02-2023, 08:57 PM)Gordius Wrote: Yes, and according to plink.mdist, the distance from a Bulgarian sample to another Bulgarian sample is no closer than from a Bulgarian sample to a Lithuanian or Estonian. Here are they in .fam format: Quote:2315 bel110c 0 0 1 1 They are from file v50.0_HO_public. If you have this HO file in plink format you can run command directly by pasting this population list into keep.txt: plink --bfile v50.0_HO_public --keep keep.txt --matrix (11-03-2023, 04:14 AM)Kale Wrote: That they plot together on G25 does not mean they have unique drift. It just means they behave similarly in relation to the other samples that compose the PCA. They can be far from each other but must be closer to each other than to Lithuanians and Estonians. But this is not observed. The only command that slightly "improves" the distances for the Bulgarians is if I make the maf parameter https://zzz.bwh.harvard.edu/plink/thresh.shtml#maf very large, from 0.1 to 0.3, for example: plink --bfile myfile --maf 0.2 --matrix But I'm not sure this is the right approach.
11-03-2023, 11:04 PM
Here are the projected individuals. I also added Macedonian IA. The Bulgarian HO seems to be very homogenous. However for the moment I can't provide any explanation for your results. |
« Next Oldest | Next Newest »
|