Login

**Anglesqueville** · 04-28-2024, 02:48 PM

Do any of you have experience with the ancIBD program? In case yes, here is my problem. I installed the software on my Linux machine without a problem. I ran the exemple files provided without any problem either. As a first exercise, I extracted from Allentoft's vcf (imputed with GLIMPSE, as is required) a handful of individuals (with vcftools). The vcf obtained are of normal size. But when I run ancIBD the vcfs reduced to 1240K, which should logically be even smaller, are huge (more than 30 Gb for chrom1!). This is of course abnormal. Where is the problem?

**Anglesqueville** · 05-05-2024, 07:11 AM

Okay, I've made progress. Receiving no response from Ringbauer, I embarked on an enterprise that seemed a bit hopeless. I reduced the Allentoft vcfs to the vital minimum (the GT and GP fields resulting from the imputation). By bcftools the operation takes hours of machine work, do not embark on this matter if you do not have a computer with significant power. The size of the VCFs is already reduced by half. When I then extract the individuals that interest me from these vcfs, I obtain hdf5 files of a size comparable to those obtained with the data from the examples provided by the authors. So, that's it, the data preparation program works on Allentoft. It remains to verify that the IBDs research program accepts my hdf5s, and, if so, that the results are consistent. I will do this during the day. If in the end everything works, it will mean that ancIBD is within the reach of us poor amateurs, like in their time qpAdm, qpWave, etc. Champagne!

**Anglesqueville** · 05-05-2024, 04:24 PM

Well, for once my optimism was premature. Everything goes well on the first three chromosomes and then goes wrong. I don't have the slightest idea what the problem is. No champagne yet (I don't like champagne anyway).

**Anglesqueville** · 05-14-2024, 02:17 PM

And finally .... Champagne! It works!

Cejo · 05-14-2024, 02:58 PM

(05-14-2024, 02:17 PM)Anglesqueville Wrote: And finally .... Champagne! It works!

Would you mind sharing a general summary of the workflow that was successful for you?

**Anglesqueville** · 05-14-2024, 03:24 PM

Cejo: It's complicated and I still need to do some checking. Precisely at the moment, I am looking at whether it is practically feasible to run ancIBD on the complete imputed data from Allentoft, or whether it is really necessary to extract individuals from the vcfs each time. When I have done this I will post a complete procedure. In essence, you must keep from vcfs only the INFO annotations relating to allele frequencies (RAF and AF), and keep all the FORMAT annotations. All this is done by bcftools annotate. Then update the headers by bcftools reheader. It is obviously also necessary to adapt the names of SNPs since Allentoft does not use rsids. Wait a bit for the details.

**Anglesqueville** · 05-14-2024, 04:45 PM

The hdf5 file for chromosome 1 and all imputed individuals from Allentoft is 780 Mb, so it's perfectly usable. You know that the mass is said once the HDF5s have been produced. This conversion on such large data takes a lot of time and a lot of machine resources, but when it's done, it's done once and for all. Then the extraction of IBDs for a dozen individuals is almost instantaneous. It's been a long time since a computer job gave me this much satisfaction. I will summarize the procedure and post it in the next few hours (or days). If nothing comes, it's because I forgot, so complain.

**Anglesqueville** · (This post was last modified: 05-15-2024, 06:13 AM by Anglesqueville.)

Guide

1) Familiarization with the program
a) Of course all this is done under LINUX. For me it is UBUNTU. I don't see why there would be problems with other LINUXes, but I can't guarantee it.
b) Everything is programmed in Python. My experience with Python is basic, and I had no difficulty. But can someone with no Python experience jump into this? I do not know. I used old Python3 without experiencing any version issues. You must have Cython installed.
c) All the doc is at https://ancibd.readthedocs.io/en/latest/index.html. Read all this carefully, familiarizing yourself with the material contained in the "Vignettes". Everything is downloadable. Install the software and its dependencies in a dedicated directory. Once this is done, run the example provided with the software. There is no point in continuing if you cannot apply the programs to the example.

2) Application to data imputed by GLIMPSE.
I used the largest available so far, those from the study by Allentoft et al. 2023:
https://erda.ku.dk/archives/917f1ac64148...chive.html
This data consists of 22 tabix-indexed vcf files (1 per chromosome). No point trying to apply ancIBD directly to these files, even the smallest one, it won't work.
a) Above all, ancIBD uses a list of 1240K SNPs identified by their rsids. But you already know that if you have reached this stage. This list must be adapted to Allentoft which codes the SNPs by chrom_position_ancestral-allele_derived-allele. Write this list (by Unix or R or other) and place it in the Map subdirectory. No need to update the AFS directory files, which only mention positions.
For what follows you must have bcftools and vcftools installed. I explain everything about chromosome 1, which is called 1.neo_impute_ph.vcf.gz
b) Now the serious things begin.
You must first filter the vcfs, in order to keep only the useful annotations. These are the "INFO" annotations RAF and AF, and ALL THE "FORMAT" annotations. From the directory containing the Allentoft data (and obviously bcftools) open a terminal. Here is the command :

Quote:./bcftools annotate -x ^INFO/RAF,INFO/AF 1.neo_impute_ph.vcf.gz --output neo_filtered_chr1.vcf.gz

I'm used to tabix indexing all vcfs, so I did that here too, but I'm not at all sure it's useful. In any case it does not represent a big effort:

Quote:tabix neo_filtered_chr1.vcf.gz

This done you have to update the header of the vcf. I haven't found anything better than the one offered by Salento on Eupedia. If any of the readers of this thread are registered on Eupedia, please send my warmest thanks to Salento.
First open the header of your vcf with the following command:

Quote:./bcftools head neo_filtered_chr1.vcf.gz > old_header_chr1.hr

Open this .hr file in a text editor. The last line, the one which contains the annotation titles and the individual codes MUST NOT BE MODIFIED. All of the above should be replaced with the following header, where the "CONTIG" indicates the chromosome number (I don't believe its length is useful, but in any case you'll find it in the CONTIGS of the original header) and therefore should be changed depending on the chromosome you are working on:

Quote:##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=30/10/2021 - 05:10:05
##source=GLIMPSE_phase v1.0.0
##contig=<ID=1,assembly=b37,length=249250621>
##INFO=<ID=RAF,Number=A,Type=Float,Description="ALT allele frequency in the reference panel">
##INFO=<ID=AF,Number=A,Type=Float,Description="ALT allele frequency computed from DS/GP field across target samples">
##INFO=<ID=INFO,Number=A,Type=Float,Description="Imputation information or quality score">
##INFO=<ID=BUF,Number=A,Type=Integer,Description="Is it a variant site falling within buffer regions? (0=no/1=yes)">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Unphased genotypes">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Genotype dosage">
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Genotype posteriors">
##FORMAT=<ID=HS,Number=1,Type=Integer,Description="Sampled haplotype pairs packed into intergers (max: 16 pairs, see NMAIN header line)">
##NMAIN=15
##INFO=<ID=pan_troglodytes,Number=1,Type=String,Description="allele observed in pan_troglodytes">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths (high-quality bases)">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=MASK_1000G,Number=0,Type=Flag,Description="SNP is in 1000G strict mask region">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##reference=ftp://ftp.1000genomes.ebi.ac.uk//vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
##bcftools_viewVersion=1.13+htslib-1.13

Save this modified header under a name, for example new_header_chr1.hr. You can nom rehead your vcf:

Quote:./bcftools reheader -h new_head_chr1.hr neo_filtered_chr1.vcf.gz > neo_filtered_reheaded_chr1.vcf.gz

c) I think it is possible to run ancIBD on the latter vcf, but in my opinion it would be a bit stupid to drag all the non-imputed individuals, identified by the suffix rnd, into the Allentoft files. I therefore advise to edit from Allentoft the list of all imputed individuals (or only those that interest you if you prefer to work on small files), under any name, for example "klist", and to keep only these individuals. This can be done for example under vcftools with the following command:

Quote:./vcftools --gzvcf neo_filtered_reheaded.vcf.gz --keep klist --recode --recode-INFO-all --stdout | bgzip -c > allentoft_for_ibd_chr1.vcf.gz

Tabix-index this file and drop it into the vcf_raw directory of your ancIBD directory. It is on it that ancIBD will work. You can now verify that everything went well. The easiest way is to run ancIBD on your vcf. You may want to start on a small chromosome, like 22, before tackling the big ones. The work is the same for you, but not for your machine. In any case, it will have to be done for everyone. If everything goes well, you will be able to delete all the vcfs that have appeared (that takes up space!), and only keep the hdf5 files. If like me you decide to do this once and for all (keeping all individuals imputed, and that's a lot of people), be warned that this all takes time, even if you have a very fast processor.
To finish, a little disclaimer. I am in no way an expert, far from it. It is therefore possible, even probable, that simpler or faster procedures exist. If this is the case I will be the first to rejoice.

**Anglesqueville** · 05-15-2024, 06:08 AM

I apologize, there were 2 critical typos in my little guide: in a command, I wrote "FORMAT" in place of "INFO". This is now corrected.

**Anglesqueville** · 05-17-2024, 02:04 PM

There it's finished. The 22 hdf5 files are snug in their box called “Allentoft_IBD”. I just spent an hour editing IBD stats all over the place. It's straightforward and blazingly fast. What a wonderful toy, Christmas in May! I can't wait for Ringbauer's data to be released.

***AimSmall*** · 05-18-2024, 12:58 AM

Any insights it has shown you?

**Anglesqueville** · 05-18-2024, 06:32 AM

(05-18-2024, 12:58 AM)AimSmall Wrote: Any insights it has shown you?

I haven't started using it systematically yet. My first attempts were more akin to the disordered behaviour of a child who has just been given a new toy. My only serious attempt so far has been on a small group made up of Russia_Minino_IA, the Karelians from Bolshoy Olenyi Ostrov and the Finns from Levänluhta, with the same observation as Allentoft (which is sad but reassuring): no genealogical link significant, at least above 8 cM, neither between Minino and Levänluhta, nor between BOO and Levänluhta. This afternoon I'm going to try to see if anything interesting appears when we throw the people from Falköping, those from Stora Förvar and the HG from Latvia into this machine. If McColl is right the Latvians should win the gold medal of IBDs shared with Falköping. It's not at all what I've seen using refinedIBD, I'm curious about what ancIBD has to tell about it.

Queequeg · 05-18-2024, 07:34 AM

(05-18-2024, 06:32 AM)Anglesqueville Wrote: My only serious attempt so far has been on a small group made up of Russia_Minino_IA, the Karelians from Bolshoy Olenyi Ostrov and the Finns from Levänluhta, with the same observation as Allentoft (which is sad but reassuring): no genealogical link significant, at least above 8 cM, neither between Minino and Levänluhta, nor between BOO and Levänluhta.

Very interesting, thank you. I'd have expected some kind of a more meaningfull IBD link between Neo538 and the Levänluhta group. "Karelians" of the BOO site are, as expected and not very surprisingly, then a different story.

**Anglesqueville** · (This post was last modified: 05-18-2024, 12:01 PM by Anglesqueville.)

Precisely:
[Image: QKMxfpP.jpg]

You see that when I say "no significant genealogical link" I am being a bit disingenuous. Let's just say I was hoping for something else...

Queequeg · 05-18-2024, 04:46 PM

Angles: if I'm right, you have anyways confirmed the connection between DA238, the most eastern shifted sample of the Levänluhta group and NEO61 of the BOO site. Apparently/possibly one of the ancestors of DA238 descended from a group having links with the BOO people, even if the Levänluhta people otherwise were of different origin?

Login
Username/Email:
Password:	Lost Password?
	Remember me