Hello guest, if you read this it means you are not registered. Click here to register in a few simple steps, you will enjoy all features of our Forum.

Check for new replies
Dataset merging help
#16
(05-20-2024, 05:50 PM)Aramu Wrote:
(05-08-2024, 09:53 AM)Genetics189291 Wrote: It’s okay I seemed to have a found a way to merge it directly which is much easier then converting it back and forth, takes a fraction of the time I’ll go through the steps when I get back home

When You convert to packedped from eigenstrat the sample group names are lost.

What is Your direct way? Can You write it please.

This is a known issue.
It is caused because of same special characters used in Group names.
Example: êëä etc.
One fix: manually edit and remove all special characters, replace them with regular characters.
Another way: if you have your files already converted - replace the column. Read the group column from .ini file and paste it in .fam file. (You can do this also in Excel )
Reply
#17
I'm having an issue with file sizes right now, I recently tried to add the Zlaty Kun sample to the 1240k dataset. The geno file of the Zlaty Kun was around 2.4MB and the 1240k geno file is roughly 5GB. Can someone explain why after using Eigenstrat to merge the two files did the resulting new file balloon by nearly 4x as much at nearly 20GB? I wouldn't mind as much if it didn't also affect performance, now running admixtools methods takes a bit longer to complete and the program constantly has to read the HDD with each and every run
Reply
#18
Error while merging two Plink files:-

Code:
4016 MB RAM detected; reserving 2008 MB for main workspace.
44 people loaded from Pathak_v1.0.fam.
1 person to be merged from SDLG7.fam.
Of these, 1 is new, while 0 are present in the base dataset.
Warning: Multiple chromosomes seen for variant 'rs748651'.
Warning: Multiple chromosomes seen for variant 'rs3112911'.
Warning: Multiple chromosomes seen for variant 'rs2247450'.
Warning: Multiple chromosomes seen for variant 'rs1072841'.
Warning: Multiple chromosomes seen for variant 'rs34472859'.
Warning: Multiple chromosomes seen for variant 'rs1987475'.
Warning: Multiple chromosomes seen for variant 'rs11103281'.
Warning: Multiple chromosomes seen for variant 'rs1344098'.
Warning: Multiple chromosomes seen for variant 'rs2097266'.
Warning: Multiple positions seen for variant 'rs8051412'.
Warning: Multiple positions seen for variant 'rs35726347'.
Warning: Multiple chromosomes seen for variant 'rs814740'.
Warning: Multiple chromosomes seen for variant 'rs601338'.
Warning: Multiple chromosomes seen for variant 'rs672163'.
Warning: Multiple chromosomes seen for variant 'rs4129148'.
Warning: Multiple chromosomes seen for variant 'rs11480'.
Warning: Multiple chromosomes seen for variant 'rs14115'.
Warning: Multiple chromosomes seen for variant 'rs306875'.
Warning: Multiple chromosomes seen for variant 'rs707689'.
Warning: Multiple chromosomes seen for variant 'rs1736462'.
Warning: Multiple chromosomes seen for variant 'rs6642242'.
Warning: Multiple chromosomes seen for variant 'rs4893102'.
Warning: Multiple chromosomes seen for variant 'rs909439'.
Warning: Multiple chromosomes seen for variant 'rs5983831'.
Warning: Multiple chromosomes seen for variant 'rs5940653'.
Warning: Multiple chromosomes seen for variant 'rs5983854'.
Warning: Multiple chromosomes seen for variant 'rs3093457'.
Warning: Multiple chromosomes seen for variant 'rs731477'.
Warning: Multiple chromosomes seen for variant 'rs731478'.
716503 markers loaded from Pathak_v1.0.bim.
1233013 markers to be merged from SDLG7.bim.
Of these, 822412 are new, while 410601 are present in the base dataset.
Error: 73425 variants with 3+ alleles present.
* If you believe this is due to strand inconsistency, try --flip with
  result.missnp.
  (Warning: if the subsequent merge seems to work, strand errors involving SNPs
  with A/T or C/G alleles probably remain in your data.  If LD between nearby
  SNPs is high, --flip-scan should detect them.)
* If you are dealing with genuine multiallelic variants, we recommend exporting
  that subset of the data to VCF (via e.g. '--recode vcf'), merging with
  another tool/script, and then importing the result; PLINK is not yet suited
  to handling them.
See https://www.cog-genomics.org/plink/1.9/data#merge3 for more discussion.
Reply
#19
(06-06-2024, 01:48 PM)ModusOperandi Wrote: I'm having an issue with file sizes right now, I recently tried to add the Zlaty Kun sample to the 1240k dataset. The geno file of the Zlaty Kun was around 2.4MB and the 1240k geno file is roughly 5GB. Can someone explain why after using Eigenstrat to merge the two files did the resulting new file balloon by nearly 4x as much at nearly 20GB? I wouldn't mind as much if it didn't also affect performance, now running admixtools methods takes a bit longer to complete and the program constantly has to read the HDD with each and every run

You will need to "clean" the new data. To make sure the new snips are matching the snips in the dataset that you merge.

1. First - you create the file with the list of all snips in your current dataset:  ( v52.2_1240K_public  )
plink --allow-no-sex --bfile v52.2_1240K_public --write-snplist --out v52.2_1240K_clean

2. Before to merge:  you filter the file that should be added to have only the snips matching  v52.2_1240K_clean  list:
plink --bfile   ZlatyKun_example  --extract v52.2_1240K_clean.snplist --make-bed --out ZlatyKun_example_clean

Next you do the merge using the clean file ( ZlatyKun_example_clean )

This way it should not increase the size significantly.
ModusOperandi likes this post
Reply
#20
(06-06-2024, 03:31 PM)Gabru77 Wrote: Error while merging two Plink files:-

Code:
4016 MB RAM detected; reserving 2008 MB for main workspace.
44 people loaded from Pathak_v1.0.fam.
1 person to be merged from SDLG7.fam.
Of these, 1 is new, while 0 are present in the base dataset.
Warning: Multiple chromosomes seen for variant 'rs748651'.
Warning: Multiple chromosomes seen for variant 'rs3112911'.
Warning: Multiple chromosomes seen for variant 'rs2247450'.
Warning: Multiple chromosomes seen for variant 'rs1072841'.
Warning: Multiple chromosomes seen for variant 'rs34472859'.
Warning: Multiple chromosomes seen for variant 'rs1987475'.
Warning: Multiple chromosomes seen for variant 'rs11103281'.
Warning: Multiple chromosomes seen for variant 'rs1344098'.
Warning: Multiple chromosomes seen for variant 'rs2097266'.
Warning: Multiple positions seen for variant 'rs8051412'.
Warning: Multiple positions seen for variant 'rs35726347'.
Warning: Multiple chromosomes seen for variant 'rs814740'.
Warning: Multiple chromosomes seen for variant 'rs601338'.
Warning: Multiple chromosomes seen for variant 'rs672163'.
Warning: Multiple chromosomes seen for variant 'rs4129148'.
Warning: Multiple chromosomes seen for variant 'rs11480'.
Warning: Multiple chromosomes seen for variant 'rs14115'.
Warning: Multiple chromosomes seen for variant 'rs306875'.
Warning: Multiple chromosomes seen for variant 'rs707689'.
Warning: Multiple chromosomes seen for variant 'rs1736462'.
Warning: Multiple chromosomes seen for variant 'rs6642242'.
Warning: Multiple chromosomes seen for variant 'rs4893102'.
Warning: Multiple chromosomes seen for variant 'rs909439'.
Warning: Multiple chromosomes seen for variant 'rs5983831'.
Warning: Multiple chromosomes seen for variant 'rs5940653'.
Warning: Multiple chromosomes seen for variant 'rs5983854'.
Warning: Multiple chromosomes seen for variant 'rs3093457'.
Warning: Multiple chromosomes seen for variant 'rs731477'.
Warning: Multiple chromosomes seen for variant 'rs731478'.
716503 markers loaded from Pathak_v1.0.bim.
1233013 markers to be merged from SDLG7.bim.
Of these, 822412 are new, while 410601 are present in the base dataset.
Error: 73425 variants with 3+ alleles present.
* If you believe this is due to strand inconsistency, try --flip with
  result.missnp.
  (Warning: if the subsequent merge seems to work, strand errors involving SNPs
  with A/T or C/G alleles probably remain in your data.  If LD between nearby
  SNPs is high, --flip-scan should detect them.)
* If you are dealing with genuine multiallelic variants, we recommend exporting
  that subset of the data to VCF (via e.g. '--recode vcf'), merging with
  another tool/script, and then importing the result; PLINK is not yet suited
  to handling them.
See https://www.cog-genomics.org/plink/1.9/data#merge3 for more discussion.

I've seen such error when merging old datasets.  The main reason for such error is because the 2 files are in different format.
Despite both are plink files: they are mapped by different specifications:
example: hg19 and hg18.
Some older files were mapped in the past by using hg18 specification.
snip names could be still the same , but the snip position is different in  hg19 and hg18.

For this reason the reported error is:  Multiple chromosomes seen for variant 'rs748651'    - that means the same snip was reported for 2 different positions. 
One solution would be  to convert the older file from hg18 to hg19.

This is called: Lift Genome Annotations
https://genome.ucsc.edu/cgi-bin/hgLiftOver
Reply
#21
For the error:
Error: 73425 variants with 3+ alleles present.

-> if the number of 3+ alleles is small - I just ignore them.

Same as the previous command above: plink --bfile ZlatyKun_example --extract v52.2_1240K_clean.snplist --make-bed --out ZlatyKun_example_clean
but instead of " --extract " use " --exclude " option.
(--exclude result.missnp <------- to exclude the conflicting 3+ alleles snips only !)

By excluding 3+ alleles present you will get lower coverage, but no errors about these conflicting snips.
Reply
#22
(06-06-2024, 06:15 PM)TanTin Wrote:
(06-06-2024, 01:48 PM)ModusOperandi Wrote: I'm having an issue with file sizes right now, I recently tried to add the Zlaty Kun sample to the 1240k dataset. The geno file of the Zlaty Kun was around 2.4MB and the 1240k geno file is roughly 5GB. Can someone explain why after using Eigenstrat to merge the two files did the resulting new file balloon by nearly 4x as much at nearly 20GB? I wouldn't mind as much if it didn't also affect performance, now running admixtools methods takes a bit longer to complete and the program constantly has to read the HDD with each and every run

You will need to "clean" the new data. To make sure the new snips are matching the snips in the dataset that you merge.

1. First - you create the file with the list of all snips in your current dataset:  ( v52.2_1240K_public  )
plink --allow-no-sex --bfile v52.2_1240K_public --write-snplist --out v52.2_1240K_clean

2. Before to merge:  you filter the file that should be added to have only the snips matching  v52.2_1240K_clean  list:
plink --bfile   ZlatyKun_example  --extract v52.2_1240K_clean.snplist --make-bed --out ZlatyKun_example_clean

Next you do the merge using the clean file ( ZlatyKun_example_clean )

This way it should not increase the size significantly.

I did all the above steps and the file size is still roughly the same
Reply
#23
(06-11-2024, 04:48 PM)ModusOperandi Wrote: I did all the above steps and the file size is still roughly the same

Show Content

There are 2 bim files for the 2 detasets that you are merging.
See the list of the snips in your bim files. Snip name and snip position should be matching for the 2 files.
Example: snip rs3094315 is on position 752566 .

If your datasets are different versions: the snip name may be the same, but the snip position could be different for both files.
Reply
#24
The problem could be that you likely want packedancestrymap format, not eigenstrat or ancestrymap.
Reply

Check for new replies

Forum Jump:


Users browsing this thread: 1 Guest(s)