Hello guest, if you read this it means you are not registered. Click here to register in a few simple steps, you will enjoy all features of our Forum.

PLINK. Questions.
#16
--sample-diff perhaps???

Never messed with it
Gordius and like this post
Reply
#17
It seems that IBS and --distance (actually an analogue of IBS) are to some extent indicators of distances between samples. As far as I understand, clustering and pca are based on IBS distances. I have an unexplainable situation with IBS. I separated a small one with several populations from the Reich HO file. Among them were Lithuanians, Estonians, Ukrainians, Belarusians, Hungarians, Bulgarians and Mordovians. When I built the IBS matrix and sorted it, for example, for Lithuanians, Lithuanians were mostly closest, and the distance is relatively close. For the Estonians, the Estonians were the closest, and the distances are also close. But for Bulgarians, the nearest were not necessarily Bulgarians, but various samples, including Lithuanians, Estonians, Bulgarians, etc. The same for Hungarians and Mordovians. And the distances were large, three times greater than that of Lithuanians or Estonians to the nearest samples. Probably some problem arises with some populations. I thought that maybe this is the result of a large percentage of missing calls among the same Bulgarians, but no, I checked the missing calls by --missing, there the percentage is about the same as among the Lithuanians. What could be the problem?
Reply
#18
(10-27-2023, 09:19 PM)Gordius Wrote: It seems that IBS and --distance (actually an analogue of IBS) are to some extent indicators of distances between samples. As far as I understand, clustering and pca are based on IBS distances. I have an unexplainable situation with IBS. I separated a small one with several populations from the Reich HO file. Among them were Lithuanians, Estonians, Ukrainians, Belarusians, Hungarians, Bulgarians and Mordovians. When I built the IBS matrix and sorted it, for example, for Lithuanians, Lithuanians were mostly closest, and the distance is relatively close. For the Estonians, the Estonians were the closest, and the distances are also close. But for Bulgarians, the nearest were not necessarily Bulgarians, but various samples, including Lithuanians, Estonians, Bulgarians, etc. The same for Hungarians and Mordovians. And the distances were large, three times greater than that of Lithuanians or Estonians to the nearest samples. Probably some problem arises with some populations. I thought that maybe this is the result of a large percentage of missing calls among the same Bulgarians, but no, I checked the missing calls by --missing, there the percentage is about the same as among the Lithuanians. What could be the problem?

Could the effect that applies to f-stat derived calculations be at play here?
Where members of a recently admixed population, whos sources are relatively divergent, will be closer to the sources than to each other.
Reply
#19
(10-31-2023, 05:45 PM)Kale Wrote:
(10-27-2023, 09:19 PM)Gordius Wrote: It seems that IBS and --distance (actually an analogue of IBS) are to some extent indicators of distances between samples. As far as I understand, clustering and pca are based on IBS distances. I have an unexplainable situation with IBS. I separated a small one with several populations from the Reich HO file. Among them were Lithuanians, Estonians, Ukrainians, Belarusians, Hungarians, Bulgarians and Mordovians. When I built the IBS matrix and sorted it, for example, for Lithuanians, Lithuanians were mostly closest, and the distance is relatively close. For the Estonians, the Estonians were the closest, and the distances are also close. But for Bulgarians, the nearest were not necessarily Bulgarians, but various samples, including Lithuanians, Estonians, Bulgarians, etc. The same for Hungarians and Mordovians. And the distances were large, three times greater than that of Lithuanians or Estonians to the nearest samples. Probably some problem arises with some populations. I thought that maybe this is the result of a large percentage of missing calls among the same Bulgarians, but no, I checked the missing calls by --missing, there the percentage is about the same as among the Lithuanians. What could be the problem?

Could the effect that applies to f-stat derived calculations be at play here?
Where members of a recently admixed population, whos sources are relatively divergent, will be closer to the sources than to each other.

Bulgarians cannot be called a recently mixed population. Rather, it can be mixed, but not as much as, for example, Brazilians, who have components of Europeans, Africans and Amerindians in different proportions for different people. Some of those Bulgarian samples are on G25, they are definitely much closer to each other than to Lithuanians and Estonians. Not sure what the problem is, I've already used the  command --geno to filter out the snps that might be present in some  samples and not in the others, but without result.
Reply
#20
(10-31-2023, 09:25 PM)Gordius Wrote:
(10-31-2023, 05:45 PM)Kale Wrote:
(10-27-2023, 09:19 PM)Gordius Wrote: It seems that IBS and --distance (actually an analogue of IBS) are to some extent indicators of distances between samples. As far as I understand, clustering and pca are based on IBS distances. I have an unexplainable situation with IBS. I separated a small one with several populations from the Reich HO file. Among them were Lithuanians, Estonians, Ukrainians, Belarusians, Hungarians, Bulgarians and Mordovians. When I built the IBS matrix and sorted it, for example, for Lithuanians, Lithuanians were mostly closest, and the distance is relatively close. For the Estonians, the Estonians were the closest, and the distances are also close. But for Bulgarians, the nearest were not necessarily Bulgarians, but various samples, including Lithuanians, Estonians, Bulgarians, etc. The same for Hungarians and Mordovians. And the distances were large, three times greater than that of Lithuanians or Estonians to the nearest samples. Probably some problem arises with some populations. I thought that maybe this is the result of a large percentage of missing calls among the same Bulgarians, but no, I checked the missing calls by --missing, there the percentage is about the same as among the Lithuanians. What could be the problem?

Could the effect that applies to f-stat derived calculations be at play here?
Where members of a recently admixed population, whos sources are relatively divergent, will be closer to the sources than to each other.

Bulgarians cannot be called a recently mixed population. Rather, it can be mixed, but not as much as, for example, Brazilians, who have components of Europeans, Africans and Amerindians in different proportions for different people. Some of those Bulgarian samples are on G25, they are definitely much closer to each other than to Lithuanians and Estonians. Not sure what the problem is, I've already used the  command --geno to filter out the snps that might be present in some  samples and not in the others, but without result.

Cluster identification with plink is made based on more principal components, not only the first 2. 
What you see on the first 2 PCs doesn't show the extra details from other hidden components. To be able to see all the details for the clusters you should check further details. plink is not perfect for the cluster identification and it doesn't provide many options to specify the criteria for cluster identification. There are many other tools for hierarchical clustering. However plink is very powerful tool to generate PCA data, same data can be used by the other  hierarchical clustering apps.
Reply
#21
(11-02-2023, 02:17 PM)TanTin Wrote:
(10-31-2023, 09:25 PM)Gordius Wrote:
(10-31-2023, 05:45 PM)Kale Wrote:
(10-27-2023, 09:19 PM)Gordius Wrote: It seems that IBS and --distance (actually an analogue of IBS) are to some extent indicators of distances between samples. As far as I understand, clustering and pca are based on IBS distances. I have an unexplainable situation with IBS. I separated a small one with several populations from the Reich HO file. Among them were Lithuanians, Estonians, Ukrainians, Belarusians, Hungarians, Bulgarians and Mordovians. When I built the IBS matrix and sorted it, for example, for Lithuanians, Lithuanians were mostly closest, and the distance is relatively close. For the Estonians, the Estonians were the closest, and the distances are also close. But for Bulgarians, the nearest were not necessarily Bulgarians, but various samples, including Lithuanians, Estonians, Bulgarians, etc. The same for Hungarians and Mordovians. And the distances were large, three times greater than that of Lithuanians or Estonians to the nearest samples. Probably some problem arises with some populations. I thought that maybe this is the result of a large percentage of missing calls among the same Bulgarians, but no, I checked the missing calls by --missing, there the percentage is about the same as among the Lithuanians. What could be the problem?

Could the effect that applies to f-stat derived calculations be at play here?
Where members of a recently admixed population, whos sources are relatively divergent, will be closer to the sources than to each other.

Bulgarians cannot be called a recently mixed population. Rather, it can be mixed, but not as much as, for example, Brazilians, who have components of Europeans, Africans and Amerindians in different proportions for different people. Some of those Bulgarian samples are on G25, they are definitely much closer to each other than to Lithuanians and Estonians. Not sure what the problem is, I've already used the  command --geno to filter out the snps that might be present in some  samples and not in the others, but without result.

Cluster identification with plink is made based on more principal components, not only the first 2. 
What you see on the first 2 PCs doesn't show the extra details from other hidden components. To be able to see all the details for the clusters you should check further details. plink is not perfect for the cluster identification and it doesn't provide many options to specify the criteria for cluster identification. There are many other tools for hierarchical clustering. However plink is very powerful tool to generate PCA data, same data can be used by the other  hierarchical clustering apps.

Are you about PCA? Yes, I know that on average each eigenvector of the first 20 only captures about 1% (plus or minus) of the total variability, so its results should be treated very carefully. But I didn't mean PСA, but IBS matrix (as well as a similar distance matrix). In these matrices, the total difference between genotypes should be calculated.
Reply
#22
(11-02-2023, 05:46 PM)Gordius Wrote: Are you about PCA? Yes, I know that on average each eigenvector of the first 20 only captures about 1% (plus or minus) of the total variability, so its results should be treated very carefully. But I didn't mean PСA, but IBS matrix (as well as a similar distance matrix). In these matrices, the total difference between genotypes should be calculated.

https://www.cog-genomics.org/plink/1.9/strat
https://zzz.bwh.harvard.edu/plink/strat.shtml

Quote:Dimension reduction
PLINK 1.9 provides two dimension reduction routines: --pca, for principal components analysis (PCA) based on the variance-standardized relationship matrix, and --mds-plot, for multidimensional scaling (MDS) based on raw Hamming distances. Top principal components are generally used as covariates in association analysis regressions to help correct for population stratification, while MDS coordinates help with visualizing genetic distances.

PCA is just a reduction from the distance matrix.  Howeve if you wish you may increase the number of principal components and to use a lot more.
There is such option to specify the number of PCs.
By default, --pca extracts the top 20 principal components of the variance-standardized relationship matrix;
Reply
#23
(11-02-2023, 06:07 PM)TanTin Wrote: PCA is just a reduction from the distance matrix.  Howeve if you wish you may increase the number of principal components and to use a lot more.
There is such option to specify the number of PCs.
By default, --pca extracts the top 20 principal components of the variance-standardized relationship matrix;

You didn't quite understand me. In this case, I am just less interested in PСA. I'm just interested in the matrix as a result of the following commands:  
plink --bfile mydata --matrix          
plink --bfile mydata --distance       
plink --bfile mydata --distance-matrix

According to the results some populations from the HO file behave strangely.
Reply
#24
(11-02-2023, 07:24 PM)Gordius Wrote:
(11-02-2023, 06:07 PM)TanTin Wrote: PCA is just a reduction from the distance matrix.  Howeve if you wish you may increase the number of principal components and to use a lot more.
There is such option to specify the number of PCs.
By default, --pca extracts the top 20 principal components of the variance-standardized relationship matrix;

You didn't quite understand me. In this case, I am just less interested in PСA. I'm just interested in the matrix as a result of the following commands:  
plink --bfile mydata --matrix          
plink --bfile mydata --distance       
plink --bfile mydata --distance-matrix

According to the results some populations from the HO file behave strangely.

by using: --distance-matrix  you will get:    plink.mdist  as result


Quote:The default behavior of --matrix to to output similarities (proportions of alleles IBS). To generate a distance matrix (1-IBS) then use the command
plink --file mydata --cluster --distance-matrix
instead. This will generate a file
    plink.mdist



So: IBS is the similarities matrix   // plink.mibs  - symmetric matrix of the IBS distances (similarities)
1-IBS is the distance matrix  
Quote:Backwards compatibility
--distance-matrix
--ibs-matrix

These deprecated flags generate space-delimited text matrices, and are included for backwards compatibility with scripts relying on the corresponding PLINK 1.07 flags. New scripts should migrate to "--distance 1-ibs flat-missing" and "--distance ibs flat-missing".

Note that you are no longer required to use these flags in conjunction with --cluster.

plink.mdist   is used as  a raw data for the PCA.
Reply
#25
(11-02-2023, 07:40 PM)TanTin Wrote: plink.mdist   is used as  a raw data for the PCA.


Yes, and according to plink.mdist, the distance from a Bulgarian sample to another Bulgarian sample is no closer than from a Bulgarian sample to a Lithuanian or Estonian.
Reply
#26
(11-02-2023, 08:57 PM)Gordius Wrote: Yes, and according to plink.mdist, the distance from a Bulgarian sample to another Bulgarian sample is no closer than from a Bulgarian sample to a Lithuanian or Estonian.

Please provide the IDs of these samples. There are many possible reasons.
Reply
#27
(10-31-2023, 09:25 PM)Gordius Wrote:
(10-31-2023, 05:45 PM)Kale Wrote:
(10-27-2023, 09:19 PM)Gordius Wrote: It seems that IBS and --distance (actually an analogue of IBS) are to some extent indicators of distances between samples. As far as I understand, clustering and pca are based on IBS distances. I have an unexplainable situation with IBS. I separated a small one with several populations from the Reich HO file. Among them were Lithuanians, Estonians, Ukrainians, Belarusians, Hungarians, Bulgarians and Mordovians. When I built the IBS matrix and sorted it, for example, for Lithuanians, Lithuanians were mostly closest, and the distance is relatively close. For the Estonians, the Estonians were the closest, and the distances are also close. But for Bulgarians, the nearest were not necessarily Bulgarians, but various samples, including Lithuanians, Estonians, Bulgarians, etc. The same for Hungarians and Mordovians. And the distances were large, three times greater than that of Lithuanians or Estonians to the nearest samples. Probably some problem arises with some populations. I thought that maybe this is the result of a large percentage of missing calls among the same Bulgarians, but no, I checked the missing calls by --missing, there the percentage is about the same as among the Lithuanians. What could be the problem?

Could the effect that applies to f-stat derived calculations be at play here?
Where members of a recently admixed population, whos sources are relatively divergent, will be closer to the sources than to each other.

Bulgarians cannot be called a recently mixed population. Rather, it can be mixed, but not as much as, for example, Brazilians, who have components of Europeans, Africans and Amerindians in different proportions for different people. Some of those Bulgarian samples are on G25, they are definitely much closer to each other than to Lithuanians and Estonians. Not sure what the problem is, I've already used the  command --geno to filter out the snps that might be present in some  samples and not in the others, but without result.

Sure, maybe recent wasn't the best term. Populations without post-admixture bottlenecks. 
That they plot together on G25 does not mean they have unique drift. It just means they behave similarly in relation to the other samples that compose the PCA.
Reply
#28
(11-02-2023, 09:09 PM)TanTin Wrote:
(11-02-2023, 08:57 PM)Gordius Wrote: Yes, and according to plink.mdist, the distance from a Bulgarian sample to another Bulgarian sample is no closer than from a Bulgarian sample to a Lithuanian or Estonian.

Please provide the IDs of these samples. There are many possible reasons.

Here are they in .fam format: 

Quote:2315 bel110c 0 0 1 1
2337 bel23s 0 0 1 1
2280 bel30s 0 0 1 1
2292 bel72c 0 0 1 1
2326 bel8s 0 0 1 1
2303 bel93c 0 0 1 1
2327 belarusian23vp 0 0 1 1
2338 belarusian47zp 0 0 1 1
2105 BulgarianA1 0 0 1 1
2116 BulgarianB1 0 0 1 1
2095 BulgarianB4 0 0 1 1
2127 BulgarianC1 0 0 1 1
2094 BulgarianE2 0 0 1 1
2138 BulgarianF1 0 0 1 1
2160 BulgarianF2 0 0 2 1
2148 BulgarianH2 0 0 2 1
2317 Est358 0 0 1 1
2305 Est372 0 0 1 1
2282 Est375 0 0 1 1
2295 Est377 0 0 1 1
2283 Est380 0 0 1 1
2294 Est391 0 0 1 1
2328 Est400 0 0 1 1
2106 HungarianC5 0 0 1 1
2161 HungarianD1 0 0 1 1
2117 HungarianE5 0 0 1 1
2118 HungarianH3 0 0 1 1
2128 hungary15 0 0 1 1
2149 hungary2 0 0 1 1
2139 hungary20 0 0 1 1
2096 hungary6 0 0 1 1
2107 hungary7 0 0 1 1
2140 lithuania1 0 0 1 1
2108 lithuania10 0 0 1 1
2150 lithuania2 0 0 1 1
2097 lithuania3 0 0 1 1
2141 lithuania8 0 0 1 1
2119 lithuania9 0 0 1 1
2129 LithuanianA1 0 0 1 1
2162 LithuanianD1 0 0 2 1
2130 LithuanianE2 0 0 2 1
2087 LithuanianF1 0 0 2 1
2319 Mordovians1 0 0 1 1
2318 Mordovians17 0 0 1 1
2329 Mordovians22 0 0 1 1
2284 Mordovians28 0 0 1 1
2296 Mordovians30 0 0 1 1
2307 Mordovians31 0 0 1 1
2341 Mordovians32 0 0 1 1
2306 Mordovians4 0 0 1 1
2353 Mordovians5 0 0 1 1
752 UKR-1283 0 0 1 1
753 UKR-1291 0 0 1 1
754 UKR-1292 0 0 1 1
755 UKR-1377 0 0 1 1
756 UKR-1399 0 0 1 1
757 UKR-1903 0 0 1 1
758 UKR-1909 0 0 1 1
759 UKR-1913 0 0 1 1
760 UKR-1951 0 0 1 1
761 UKR-1978 0 0 1 1
762 UKR-1992 0 0 1 1
763 UKR-2021 0 0 1 1
2281 UkrBel620 0 0 1 1
2293 UkrBel622 0 0 1 1
2304 UkrBel733 0 0 2 1
2316 UkrBel736 0 0 2 1

They are from file v50.0_HO_public. If you have this HO file in plink format you can run command directly by pasting this population list into keep.txt:

plink --bfile v50.0_HO_public --keep keep.txt --matrix
Reply
#29
(11-03-2023, 04:14 AM)Kale Wrote: That they plot together on G25 does not mean they have unique drift. It just means they behave similarly in relation to the other samples that compose the PCA.

They can be far from each other but must be closer to each other than to Lithuanians and Estonians. But this is not observed. The only command that slightly "improves" the distances for the Bulgarians is if I make the maf parameter https://zzz.bwh.harvard.edu/plink/thresh.shtml#maf  very large, from 0.1 to 0.3, for example:

plink --bfile myfile --maf 0.2 --matrix

But I'm not sure this is the right approach.
AimSmall likes this post
Reply
#30
[Image: Bulgarian-HO.png]

Here are the projected individuals.
I also added Macedonian IA.
The Bulgarian HO seems to be very homogenous. 
However for the moment I can't provide any explanation for your results.
AimSmall likes this post
Reply


Forum Jump:


Users browsing this thread: 1 Guest(s)