Hello guest, if you read this it means you are not registered. Click here to register in a few simple steps, you will enjoy all features of our Forum.

Check for new replies
What is the penalty and how do use it?
#1
I am just learning how to use R+nMonte and I found stuffs like "pen=0" or "pen=0.001" in forums and I would like to know what it is, if my results would change, which penalty number is good and how I use it.
lg16 likes this post
23andMe: 55.5% European, 33.7% Indigenous American, 4.2% WANA, 3.4% SSA and 3.2% Unassigned
AncestryDNA: 57.27% Europe, 35.81% Indigenous Americas-Mexico, 3.46% MENA and 3.45% SSA
FamilyTreeDNA: 56.9% Europe, 33% Americas, 8.2% MENA, <2% Horn of Africa and <1% Eastern India
Living DNA: 63.3% West Iberia, 34.3% Native Americas and 2.3% Yorubaland
MyHeritage DNA: 87.4% Indigenous in Mexico and 12.6% Spanish, Catalan & Basque 

[Image: IbEDd4z.png]
Reply
#2
In nMonte3 it is advisable to use pen=0.001 to eliminate noisy percentages in the results. It is similar to the rdc 0.25x in Vahaduo, but in my opinion much more precise. When you use pen=0.001, it can give you results that vary from one another in each run, you must choose the one that best suits your ancestry.
Jalisciense, JMcB, lg16 like this post
23andMe: 98.8% Spanish & Portuguese, 0.3% Ashkenazi Jewish, 0.9% Trace Ancestry (0.4% Coptic Egypcian, 0.3% Nigerian, 0.2% Bengali & Northeast Indian).

“The truth doesn’t become more auténtico because whole world agrees with it”. RaMBaM

-M. De la Torre, converse of jew-
-M. Rivera López, converse of jew-
-D. de Castilla, converse of moor-
-M. de Navas, converse of moor-
Reply
#3
(02-01-2024, 09:45 PM)Rober_tce Wrote: In nMonte3 it is advisable to use pen=0.001 to eliminate noisy percentages in the results. It is similar to the rdc 0.25x in Vahaduo, but in my opinion much more precise. When you use pen=0.001, it can give you results that vary from one another in each race, you must choose the one that best suits your ancestry.

"2. The nMonte results are overfitted models.

In machine learning, procedures directed at improving this situation are called 'regularization'.

In nMonte3 the results are regularized by some degree of penalizing reference samples with great distances to the target. In other words, admixtures with small distances to the target are preferred to admixtures with large distances to the target. By default the degree of penalization is set at 0.001."

https://www.dropbox.com/sh/1iaggxyc2alaf...Monte3.doc

About that point above, does that mean samples with a big genetic distance to the target (Myself for example with my Spanish, Native American and SSA ancestries) is better to use the pen=0.001? Or that if you have ancestry from very genetically related populations (Spain and Northern Italy for example) it is better to use pen=0.001?

I have been testing and I'm not sure which one is better, because there are components that matches with papers/studies in pen=0.001 but my percentage of Native American is lower, so I have to use Pima instead Nahua here to increase the NA percentage, but with pen=0 some components do not match with papers/studies but my Native American increase that I have to use a Nahua and not Pima to balance the percentages and score a nice percentage (And I am a descendant from Nahuas).

Here are some examples of what I'm saying:

Code:
[1] "penalty= 0.001"
[1] "Ncycles= 1000"
[1] "distance%=2.1766"

        Jalisciense

Roman,38
Native American Nahua,30
Iberian IA,26.4
North African,4
West African,1.6

[1] "penalty= 0.001"
[1] "Ncycles= 1000"
[1] "distance%=2.0138"

         Jalisciense

Native American Pima,32.2
Roman,31.8
Iberian IA,29
North African,5.4
West African,1.6

                                     Vs

[1] "penalty= 0"
[1] "Ncycles= 1000"
[1] "distance%=1.466"

        Jalisciense

Iberian IA,38.2
Native American Nahua,37.2
Roman,15.4
North African,6.4
West African,2.8

[1] "penalty= 0"
[1] "Ncycles= 1000"
[1] "distance%=1.3851"

         Jalisciense

Native American Pima,37.6
Iberian,36.8
Roman,13.8
North African,8.6
West African,3.2

So what one lacks, the other has...

Btw in the data/source, can I use populations average? Or should I use individuals? Because my results are different, besides I ask that because I read this:

"THE REFERENCE DATA SHOULD BE RAW DATA, NOT AVERAGES OR MEDIANS."

https://www.dropbox.com/sh/1iaggxyc2alaf...Monte3.doc
JMcB likes this post
23andMe: 55.5% European, 33.7% Indigenous American, 4.2% WANA, 3.4% SSA and 3.2% Unassigned
AncestryDNA: 57.27% Europe, 35.81% Indigenous Americas-Mexico, 3.46% MENA and 3.45% SSA
FamilyTreeDNA: 56.9% Europe, 33% Americas, 8.2% MENA, <2% Horn of Africa and <1% Eastern India
Living DNA: 63.3% West Iberia, 34.3% Native Americas and 2.3% Yorubaland
MyHeritage DNA: 87.4% Indigenous in Mexico and 12.6% Spanish, Catalan & Basque 

[Image: IbEDd4z.png]
Reply
#4
"nMonte3 reverses the workflow: it feeds unaveraged data into the nMonte algorithm and only afterwards aggregates the results by population"

How does one utilize this function?
I assume you have samples listed as Pop1_Sample1, Pop1_Sample2, Pop1_Sample3, etc. and you'll get (ex.) 5% Sample1, 10% sample2, 15% sample3, but the result will show 30% Pop1 (not listing the samples?) Is there a specific way it has to be formatted?
Reply
#5
I've never used the R nMonte but I would assume Vahaduo is based on that and works the same way.
ADC, which should be the equivalent of penalty, adds the distance from the target to each source coordinate as an extra column, in G25's case like it was the 26th component. I personally don't like it as it will give you worse results when the sample is actually a admixed with distant populations, and I don't see how it reduces overfitting.
Aggregate works with the unaveraged datasheet where you have samples in Pop:Sample format.
Reply
#6
(02-06-2024, 07:03 PM)Kale Wrote: "nMonte3 reverses the workflow: it feeds unaveraged data into the nMonte algorithm and only afterwards aggregates the results by population"

How does one utilize this function?
I assume you have samples listed as Pop1_Sample1, Pop1_Sample2, Pop1_Sample3, etc. and you'll get (ex.) 5% Sample1, 10% sample2, 15% sample3, but the result will show 30% Pop1 (not listing the samples?) Is there a specific way it has to be formatted?

All samples that are labeled with the same name will be combined and added to the result (For example: You have 3 Nahuas in the Data -Nahua:1, Nahua:2 and Nahua:3- will be just 'Nahua" at the end of the run).

And as far I know is automatic and there is no way to remove it.
23andMe: 55.5% European, 33.7% Indigenous American, 4.2% WANA, 3.4% SSA and 3.2% Unassigned
AncestryDNA: 57.27% Europe, 35.81% Indigenous Americas-Mexico, 3.46% MENA and 3.45% SSA
FamilyTreeDNA: 56.9% Europe, 33% Americas, 8.2% MENA, <2% Horn of Africa and <1% Eastern India
Living DNA: 63.3% West Iberia, 34.3% Native Americas and 2.3% Yorubaland
MyHeritage DNA: 87.4% Indigenous in Mexico and 12.6% Spanish, Catalan & Basque 

[Image: IbEDd4z.png]
Reply
#7
So thought I'd share a few thoughts from the author of nMonte.... Dr. Mark Huijbregts.

#*******************************************************
# R script nMonte.R
# Find mixture composition which minimizes
# the averaged genetic distance to target.
# Penalizing of distant admixtures.
# Activate with: source('nMonte3_temp.R')
# Use: getMonte(datafile, targetfile, pen=0.01);
# both files should be comma-separated csv.
# Utilities:
# subset_data(): Collecting rows from datasheet
# aggr_pops(): Average populations
# tab2comma(): tab-separated to comma-separated
# last modified: headStrings
# v10.4 Huijbregts 8 jan 2018
#*******************************************************


Blogger huijbregts said...
I was surprised to find Matt yesterday explaining the place of nMonte in genetic history. I was not aware of this all.
Hopefully David will permit me a few supplementary remarks.
I wrote nMonte as an experiment because I was curious whether a simple random walk algorithm could identify relevant groups.
I was pleasantly surprises when it did. Actually it identified a single set of non-unique samples; next it used the naming labels within this set to infer distances to well known predefined groups.
I am still surprised that the simple trick worked so well.
By then (about 2015) I knew next to nothing about mathematical genetics.
By now I better understand the problems with this kind of algorithms.
In the first place there is the problem of overfitting. The ancients division of Global25 contains some five thousand samples. This permits no more 12 binary choices (because 2^12 = 4096).
And if you select a subset, the number is still smaller. So if you use 25 dimensions, you are are heavily overfitting.
Many guys have tried to repair this by by what they call "scaling" the data, but which is really an anti-scaling and does not solve the problem of too many dimensions. It is much better to truncate the dimensions at a much lower value than 25.
In nMonte3 I have limited the damage of overfitting by applying a penalty on larger distances. Unfortunately I also offered the opportunity to switch off this penalizing by using the option pen=0. In spite of my repeated warning not to use this, if you do not perfectly understand what you are doing, many users interpreted this as an opportunity to prove there expert status Sad

If I were younger, I might try to improve nMonte by using a Bayesian algorithm.
As it is, I incidentally use nMonte as a quick and dirty simple method.
But mostly I am content with visualizing the data with algorithms like UMAP, which I think is underestimated, especially at this forum.

January 23, 2022 at 2:41 AM


Blogger Matt said...
@huijbregts; Scaling is just using the eigenvalue data provided by the PCA the data to restore the property where distance computed from in the data to represent the underlying frequency differentiation by the PCA.

Unscaled data simply "throws away" this eigenvalue data, and distances in the data matrix do not represent the underlying frequency differentation captured by the PCA.

Prive, as I understand it, "scales" his data in his preprints where he relates distance across PCA dimensions to Fst, how it can be used to define subsets of participents in UKBiobank and other datasets, and where he describes how polygenic risk scores (PRS) accuracy falls off with increasing distance. Peter "scales" PCA data in the preprint where he relates PCA distances to f-stats.

If either had used unscaled data, there would be no or far reduced relationship of either of these well established measures of genetic differentiation to PCA distances.

If an academic consensus emerges on the relationship of distances on PCA to mainstream established population genetics measures of differentiation, it will not be based on unscaled distance.

If you have an interest in understanding these topics, I would invite you to read these preprints.

January 23, 2022 at 3:20 AM
Jalisciense and JMcB like this post
Reply
#8
(02-14-2024, 10:13 PM)kolompar Wrote: I've never used the R nMonte but I would assume Vahaduo is based on that and works the same way.
ADC, which should be the equivalent of penalty, adds the distance from the target to each source coordinate as an extra column, in G25's case like it was the 26th component. I personally don't like it as it will give you worse results when the sample is actually a admixed with distant populations, and I don't see how it reduces overfitting.
Aggregate works with the unaveraged datasheet where you have samples in Pop:Sample format.

Yeah, both are Monte Carlo Algorithm.

I have been using it since then and the results in R+nMonte3 with standard penalty give me better results than in Vahaduo (And even with results closer to scientific studies, at least for Latin Americans and Iberians).
23andMe: 55.5% European, 33.7% Indigenous American, 4.2% WANA, 3.4% SSA and 3.2% Unassigned
AncestryDNA: 57.27% Europe, 35.81% Indigenous Americas-Mexico, 3.46% MENA and 3.45% SSA
FamilyTreeDNA: 56.9% Europe, 33% Americas, 8.2% MENA, <2% Horn of Africa and <1% Eastern India
Living DNA: 63.3% West Iberia, 34.3% Native Americas and 2.3% Yorubaland
MyHeritage DNA: 87.4% Indigenous in Mexico and 12.6% Spanish, Catalan & Basque 

[Image: IbEDd4z.png]
Reply
#9
(02-14-2024, 10:21 PM)AimSmall Wrote: So thought I'd share a few thoughts from the author of nMonte.... Dr. Mark Huijbregts.

#*******************************************************
# R script nMonte.R
# Find mixture composition which minimizes
# the averaged genetic distance to target.
# Penalizing of distant admixtures.
# Activate with: source('nMonte3_temp.R')
# Use: getMonte(datafile, targetfile, pen=0.01);
# both files should be comma-separated csv.
# Utilities:
# subset_data(): Collecting rows from datasheet
# aggr_pops(): Average populations
# tab2comma(): tab-separated to comma-separated
# last modified: headStrings
# v10.4 Huijbregts 8 jan 2018
#*******************************************************


Blogger huijbregts said...
I was surprised to find Matt yesterday explaining the place of nMonte in genetic history. I was not aware of this all.
Hopefully David will permit me a few supplementary remarks.
I wrote nMonte as an experiment because I was curious whether a simple random walk algorithm could identify relevant groups.
I was pleasantly surprises when it did. Actually it identified a single set of non-unique samples; next it used the naming labels within this set to infer distances to well known predefined groups.
I am still surprised that the simple trick worked so well.
By then (about 2015) I knew next to nothing about mathematical genetics.
By now I better understand the problems with this kind of algorithms.
In the first place there is the problem of overfitting. The ancients division of Global25 contains some five thousand samples. This permits no more 12 binary choices (because 2^12 = 4096).
And if you select a subset, the number is still smaller. So if you use 25 dimensions, you are are heavily overfitting.
Many guys have tried to repair this by by what they call "scaling" the data, but which is really an anti-scaling and does not solve the problem of too many dimensions. It is much better to truncate the dimensions at a much lower value than 25.
In nMonte3 I have limited the damage of overfitting by applying a penalty on larger distances. Unfortunately I also offered the opportunity to switch off this penalizing by using the option pen=0. In spite of my repeated warning not to use this, if you do not perfectly understand what you are doing, many users interpreted this as an opportunity to prove there expert status Sad

If I were younger, I might try to improve nMonte by using a Bayesian algorithm.
As it is, I incidentally use nMonte as a quick and dirty simple method.
But mostly I am content with visualizing the data with algorithms like UMAP, which I think is underestimated, especially at this forum.

January 23, 2022 at 2:41 AM


Blogger Matt said...
@huijbregts; Scaling is just using the eigenvalue data provided by the PCA the data to restore the property where distance computed from in the data to represent the underlying frequency differentiation by the PCA.

Unscaled data simply "throws away" this eigenvalue data, and distances in the data matrix do not represent the underlying frequency differentation captured by the PCA.

Prive, as I understand it, "scales" his data in his preprints where he relates distance across PCA dimensions to Fst, how it can be used to define subsets of participents in UKBiobank and other datasets, and where he describes how polygenic risk scores (PRS) accuracy falls off with increasing distance. Peter "scales" PCA data in the preprint where he relates PCA distances to f-stats.

If either had used unscaled data, there would be no or far reduced relationship of either of these well established measures of genetic differentiation to PCA distances.

If an academic consensus emerges on the relationship of distances on PCA to mainstream established population genetics measures of differentiation, it will not be based on unscaled distance.

If you have an interest in understanding these topics, I would invite you to read these preprints.

January 23, 2022 at 3:20 AM

Thanks! But Huijbregts wrote "large distances", so he means to the Target? Or other populations in the Data/Source?

So I have to stick with the standard penalty and don't switch it off, got it.

About the Unscaled vs Scaled; at least for me, my family, Latin Americans and Iberians friends that I have modelled them, the unscaled in R+nMonte3 with standard penalty resemble more to the papers/studies that I have seen.
23andMe: 55.5% European, 33.7% Indigenous American, 4.2% WANA, 3.4% SSA and 3.2% Unassigned
AncestryDNA: 57.27% Europe, 35.81% Indigenous Americas-Mexico, 3.46% MENA and 3.45% SSA
FamilyTreeDNA: 56.9% Europe, 33% Americas, 8.2% MENA, <2% Horn of Africa and <1% Eastern India
Living DNA: 63.3% West Iberia, 34.3% Native Americas and 2.3% Yorubaland
MyHeritage DNA: 87.4% Indigenous in Mexico and 12.6% Spanish, Catalan & Basque 

[Image: IbEDd4z.png]
Reply

Check for new replies

Forum Jump:


Users browsing this thread: 1 Guest(s)