Login

TanTin · 11-11-2023, 03:15 AM

There was a good topic in the previous forum (https://genarchivist.freeforums.net/) regarding alignment and converting from fastq to plink .bed . However all that info is not available anymore.

Could someone share all the steps again? Thanks .

***AimSmall*** · 11-11-2023, 02:27 PM

If starting with FASTQ files... I'd recommend getting WGSExtract installed.

It can convert FASTQ and call them into BAM files. From there it will even produce RAW data extracts like 23&Me, AncestryDNA, etc.

With the RAW files, simple process from there to to make your PLINK Bed files.

https://wgsextract.github.io/

crashdoc · 11-11-2023, 03:23 PM

Here's what I had posted on the other forum:

I'm sure there are more efficent methods, but I'll give you two methods, one on usegalaxy.org and the other locally on a linux computer.

I assume you already have plink and know how to merge datasets

1) SETTING UP
-GALAXY: create an account on usegalaxy.org and upload your data (fastq or BAM file). You can use the upload data button on the right side and then choose a local file from your computer or use the paste/fetch data button to paste a url. Alternatively you can use the download tools under the Get Data header of the tools pannel on the left of the main window.
-LOCAL: a) install needed tools: bwa, bcftools, samtools, tabix

b) getting the hg19 reference genome (not needed on Galaxy since built in)
wget -O hg19.tar.gz http://hgdownload.cse.ucsc.edu/goldenPat...mFa.tar.gz
gunzip hg19.tar.gz
tar -xf hg19.tar
cat *.fa >> hg19.fa
rm *.fa
bwa index -a bwtsw hg19.fa

2) MAKING THE BAM FROM THE FASTQ
-GALAXY: use the tool "map with bwa-mem" on your fastq file by
a) selecting the reference genome in the second drop down (hg19)
b) selecting the single or paired-end option the 3rd drop down (usually single with ancient dna)
c) selecting your fastq file in the next drop down(s)

-LOCAL:
bwa mem -t 1 hg19.fa YOURFILE.fastq > YOURFILE.sam (the -t option is the number of concurrent thread to use on your computer, it will mutiply RAM usage as well)
samtools view -@ 2 -bS YOURFILE.sam | samtools sort -@ 2 -m 2G -o YOURFILE.bam (the -@ option is the number of processors to use and the -m the amount of RAM)
samtools index -@ 2 YOURFILE.bam

3) MAKING THE BCF FILE FROM THE BAM
-GALAXY: a) use "bcftools mpileup" tool on the output file of preceding step and select the reference genome (same as before = hg19)
b) use "bcftools call" tool on the output file of preceding step and select "1 - Treat all samples as haploid" in "Select predefined ploidy" under "file format options"

-LOCAL: a) bcftools mpileup -O b -o YOURFILE_mpiled.bcf -f hg19.fa YOURFILE.bam
b) bcftools call --ploidy 1 -m -O b -o YOURFILE.bcf YOURFILE_mpiled.bcf

4.1) PREPARING A FILE TO ANNOTATE WITH SNP IDS

start with the BIM file of a 1240k set
we need to keep only columns 1,2&4 and invert 2&4. We also need to replace the chromosomes numbers in the first column by chr1,chr2,chr3...chr22,chrX,chrY

-GALAXY: a) upload the BIM file
b) use "cut" tool to get 1st and 4th column
c) use the "cut" tool again to get the 2nd column
d) use the "paste" tool to aggregate the 2 files together
e) use the "replace" tool to replace the chromosomes numbers of the first column, after having uploaded a file like this:
1 chr1
2 chr2
3 chr3
4 chr4
5 chr5
6 chr6
7 chr7
8 chr8
9 chr9
10 chr10
11 chr11
12 chr12
13 chr13
14 chr14
15 chr15
16 chr16
17 chr17
18 chr18
19 chr19
20 chr20
21 chr21
22 chr22
23 chrX
24 chrY

*** In the above steps you might lose the column numbering at some point, in the right panel where you see the resulting files, you might lose the blue line with the column numbers and the file type "tabular". There are two ways of correcting that, one is editing the file type (pencil icon and then second tab), the other is to use the "convert" tool.

-LOCAL: use the tools of your choice to achieve the same result: from the BIM file keep first column but change the chromosomes names, put the fourth column next and then the 2nd one and erase the rest.

4.2) ANNOTATING WITH SNP IDS
-GALAXY: use "bcftools annotate" a) Under Add annotations in column, type: CHROM,POS,ID
b) For Annotations File, select 'From a BED or tab-delimited file' and select your annotation file that we prepared in the preceding step
c) as usual click on run tool at the bottom

-LOCAL: let's say the prepared annotation file from the BIM in the step above is called chr1240k
a) bgzip chr1240k
b) tabix -s1 -b2 -e2 num1240k.gz
c) bcftools annotate -O b -o YOURFILE_annotated.bcf -a chr1240k.gz -c CHROM,POS,ID YOURFILE.bcf

5) FILTERING (note that it is possible to do it in one step by excluding the expression: ID=="." || QUAL<25 || QUAL=="."
-GALAXY: a) keep only positions with the 1240k snps IDs: "bcftools filter" under Restrict to in the exclude box type: ID=="."
b) filter for quality: "bcftools filter" under Restrict to in the exclude box type: QUAL<25
c) cleanup empty QUAL fields in case there are some: "bcftools filter" under Restrict to in the exclude box type: QUAL=="."
* at step c) select the uncompressed VCF output type just above the run tool button

-LOCAL: a) bcftools filter -O b -o YOURFILE_clean1.bcf -e 'ID=="."' YOURFILE_annotated.bcf
b) bcftools filter -O b -o YOURFILE_clean2.bcf -e 'QUAL<25' YOURFILE_clean1.bcf
c) bcftools filter -o YOURFILE_clean.vcf -e 'QUAL=="."' YOURFILE_clean2.bcf

**IN ONE STEP: bcftools filter -o YOURFILE_clean.vcf -e 'ID=="." || QUAL<25 || QUAL=="."' YOURFILE_annotated.bcf

6) CONVERT TO PLINK
-GALAXY: you have to download the last resulting file (disk icon that appears at the bottom when you click on your file in the history pannel) and do it LOCALLY after having optionnaly renamed it YOURFILE_clean.vcf

-LOCAL: plink --const-fid 0 --vcf YOURFILE_clean.vcf --make-bed --out YOURFILE

7) LAST CLEANUP AND MERGE
a) try a merge with a dataset you already have: plink --bfile DATASETNAME --bmerge YOURFILE.bed YOURFILE.bim YOURFILE.fam --indiv-sort 0 --allow-no-sex --make-bed --out NEWDATASETNAME
b) If it gives you an error message of the kind: "Error: xx variants with 3+ alleles present" you will have to remove the allele because plink only supports two allele possibility at each variant/snp. Fortunately when we get this error plink creates the necessary file (NEWDATASETNAME-merge.missnp) and we just have to run:
plink --bfile YOURFILE --exclude NEWDATASETNAME-merge.missnp --make-bed --out YOURFILE2
and attempt to merge again:
plink --bfile DATASETNAME --bmerge YOURFILE2.bed YOURFILE2.bim YOURFILE2.fam --indiv-sort 0 --allow-no-sex --make-bed --out NEWDATASETNAME

8) You're done except you might want to edit the fam file as usual.

Again, I'm sure there are more efficient ways, especially with multiple samples, but I went with what I know, hoping others will share their knowledge as well.

crashdoc · 11-18-2023, 07:29 PM

Here are some improvements in step 3) a)

TO IMPROVE QUALITY:
In UseGalaxy:
Under "Input Filtering Options" section "Quality Options" instead of default choose "Set base and mapping quality options"
-in "per-Base Alignment Quality" select "disable BAQ"
-in "Minimum mapping quality" write 30
-in "Minimum base quality" write 30

Locally:
To add to "bcftools mpileup" command line
-B -q 30 -Q 30

TO SIGNIFICANTLY SPEED-UP PROCESS:
(restrict to 1240k Regions right from the mpileup step)
In UseGalaxy:
Under "Restrict to" section "Regions" select "Operate on Regions specified in a history dataset"
and then as "Regions File" select a file with the first two columns of the one prepared in step 4.1)

Locally:
add -R <Text file with chr and location columns> to the "bcftools mpileup" command line

--------

Also to automate the process for many samples one can use pileupCaller
https://github.com/stschiff/sequenceTool.../README.md

However one loses the possibility to filter for quality at step 5) because the quality in the bcf for a position will be a combination of the quality of all the samples at this position, so I would not advise it for low quality samples.

crashdoc · (This post was last modified: 11-22-2023, 12:37 AM by crashdoc.)

Please ignore my last comment on losing the possibility to filter for quality at step 5) because of pileupcaller. Actually, for low quality samples, IT IS NOT RECOMMENDED TO FILTER QUAL<25 (only QUAL=. which pileupcaller already does) or else we lose too many transitions and transversions reads and end up with a heavy bias towards the reference genome.

Instead it is trimming that is recommended (removing 2,4, or 6 bases on each end of each segment), for instance with trimbam from bamutil.

Just wanted to prevent others from making the same error I made!

Merriku · 01-06-2024, 05:41 PM

I would like to add something to crashdoc's excellent posts. You can add rsids to your vcf by annotating them with either a dbsnp vcf file or from a tab-delimited file as described by crashdoc. But you won't be able to annotate it successfully unless both your vcf file and the file you are using to annotate are using same chromosomal notations (coulmn 1 of vcf files). You can verify that by uploading both your vcf file and tab file/dbsnp file and isolating their first column with 'cut' function on usegalaxy. You can then download the cut columns and inspect them on notepad. I guess most vcf files will be using any of these 3 chromosomal notations (see under alias sequence names)
https://genome.ucsc.edu/cgi-bin/hgTracks...mInfoPage=

Now you can match column 1 of your vcf with the annotating file by using bcftools (either locally or on usegalaxy).
First make a text file with-

Old_chromsome_name(space)New_chromosome_name/one chromosome per line/

Then-
*On usegalaxy - upload the text file>bcftools annotate>select you vcf as input>change annotations>Rename CHROM>select your text file>Run
*Locally-
bcftools annotate --rename-chrs yourchrrenamefile.txt yourfile.vcf

TanTin · 01-11-2024, 01:45 AM

Quote:4.1) PREPARING A FILE TO ANNOTATE WITH SNP IDS

start with the BIM file of a 1240k set
we need to keep only columns 1,2&4 and invert 2&4. We also need to replace the chromosomes numbers in the first column by chr1,chr2,chr3...chr22,chrX,chrY

-GALAXY: a) upload the BIM file

- At step 4 I have a VCF file with some data like this:

1 2 3
10 C 60014
10 C 60015
10 T 60016
10 A 60017
10 A 60018

Here I already replaced CHR10 to 10 in col 1 .
Column 2 is REF and col 3 is POS.

Regarding BIM file of a 1240k set : do I use an empty file , same as the example above with CHR1 - CHR24 (Y), or I can upload some bim file like an example: v52.2_1240K_public.bim from V52 dataset?

Merriku · 01-11-2024, 04:56 AM

(01-11-2024, 01:45 AM)TanTin Wrote:
Quote:4.1) PREPARING A FILE TO ANNOTATE WITH SNP IDS

start with the BIM file of a 1240k set
we need to keep only columns 1,2&4 and invert 2&4. We also need to replace the chromosomes numbers in the first column by chr1,chr2,chr3...chr22,chrX,chrY

-GALAXY: a) upload the BIM file

- At step 4 I have a VCF file with some data like this:

1 2 3
10 C 60014
10 C 60015
10 T 60016
10 A 60017
10 A 60018

Here I already replaced CHR10 to 10 in col 1 .
Column 2 is REF and col 3 is POS.

Regarding BIM file of a 1240k set : do I use an empty file , same as the example above with CHR1 - CHR24 (Y), or I can upload some bim file like an example: v52.2_1240K_public.bim from V52 dataset?

I would suggest to check for notations of all chromosomes in your VCF file as it might be using X and Y as opposed to 23 and 24 of the 1240k bim file.

You have to upload the bim file of any 1240k dataset. Isolating and rearranging columns of the bim file,as described in the steps, would produce a delimited tab file with CHROM, POS , ID columns.

TanTin · 01-11-2024, 06:18 PM

I was able to convert a file. Few notes on this process.
1. I think I get better results by using " Map with BWA - map short reads (< 100 bp) against reference genome"
-this option may better read for short reads.

2. For small files I used Uncompressed files as output format, which allows me to see the result at each step. If file is compressed Galaxy can not display the content of the file.

TanTin · (This post was last modified: 01-11-2024, 11:19 PM by TanTin.)

One of the wasting time steps here is the creation of bim file for the annotation.
Instead of using Galaxy tools for cut / paste / replace, here is a fast script to prepare the file locally in R:

library(tidyverse)
info.bim = read.table("SouthernArc_Public.bim")
info.bim1 = subset(info.bim,select = c(1,4,2) )
for (i in 1:22){
info.bim1$V1[info.bim1$V1 == i ] <- paste0("chr" , i)
}
info.bim1$V1[info.bim1$V1 == 23 ] <- paste0("chr" , "X")
info.bim1$V1[info.bim1$V1 == 24 ] <- paste0("chr" , "Y")

write.table(info.bim1 , file = paste0( "info.bim" ) , sep = "\t",
row.names = F, col.names = F , quote = F)

TanTin · 01-12-2024, 04:13 PM

I uploaded the converted files for Buran Kaya. ( https://ufile.io/e5kkh6hl )

Uploading also a 1240k bim file that can be used directly.
https://ufile.io/8e2d5rox

TanTin · 01-15-2024, 06:21 PM

Question for those with more experience:
How do we merge fastq files for the same individual ?

For example:
https://www.ebi.ac.uk/ena/browser/view/PRJEB67353

An individual with Sarmatian-related ancestry in Roman Britain
We have 5 fastq files

Sample Title:Offord Cluny 203645
All for the same individual: Sample Alias:C10271

I created several BAM files (map with bwa-mem),
then I tried mpileup on multiple files:
bcftools mpileup ( bcftools mpileup on data 8, data 7, and data 6)

At the end in the final fam file after converting vcf to plink I got separate lines for each initial fastq file. So it is not merging them but processing as separate individuals.

crashdoc · (This post was last modified: 01-16-2024, 02:12 AM by crashdoc.)

(01-15-2024, 06:21 PM)TanTin Wrote: Question for those with more experience:
How do we merge fastq files for the same individual ?

For example:
https://www.ebi.ac.uk/ena/browser/view/PRJEB67353

An individual with Sarmatian-related ancestry in Roman Britain
We have 5 fastq files

Sample Title:Offord Cluny 203645
All for the same individual: Sample Alias:C10271

I created several BAM files (map with bwa-mem),
then I tried mpileup on multiple files:
bcftools mpileup ( bcftools mpileup on data 8, data 7, and data 6)

At the end in the final fam file after converting vcf to plink I got separate lines for each initial fastq file. So it is not merging them but processing as separate individuals.

I am no expert, but usually people use pileupcaller and this is not a problem since it can ignore the read group names and take all read from the BAM.

However if you want to use the steps outlined above, before you merge the different samples with samtools (with the -c option if in commandline), all your sam/bams must have the same read group name. One way to achieve this would be to add the -R option in the bwa step,but I have not tried it myself. To use the -R option (in galaxy it is pretty straightforward, just select the set read group information option and fill in the read group ID) in commandline you have to specify: -R "@RG\tID:xxxxx\tPL:ILLUMINA" where xxxxx is the ID you want to assign, just make sure it is the same for all fastq you process for the same individual. Again I have not tried it, so let us know how it goes!

Merriku · 01-21-2024, 05:48 AM

(01-06-2024, 05:41 PM)Merriku Wrote: I would like to add something to crashdoc's excellent posts. You can add rsids to your vcf by annotating them with either a dbsnp vcf file or from a tab-delimited file as described by crashdoc. But you won't be able to annotate it successfully unless both your vcf file and the file you are using to annotate are using same chromosomal notations (coulmn 1 of vcf files). You can verify that by uploading both your vcf file and tab file/dbsnp file and isolating their first column with 'cut' function on usegalaxy. You can then download the cut columns and inspect them on notepad. I guess most vcf files will be using any of these 3 chromosomal notations (see under alias sequence names)
https://genome.ucsc.edu/cgi-bin/hgTracks...mInfoPage=

Now you can match column 1 of your vcf with the annotating file by using bcftools (either locally or on usegalaxy).
First make a text file with-

Old_chromsome_name(space)New_chromosome_name/one chromosome per line/

Then-
*On usegalaxy - upload the text file>bcftools annotate>select you vcf as input>change annotations>Rename CHROM>select your text file>Run
*Locally-
bcftools annotate --rename-chrs yourchrrenamefile.txt yourfile.vcf

Adding on to this-
*the hg19 aligned BAMs/VCFs use (chr1-chr22,chrX,chrY) chromosomal notation.
*GRCh37 aligned BAMs/VCFs (Dantelabs for example) use (1-22,X,Y) chromosomal notation.
*while 1240k bim file uses (1-24).
*recent builds of the dbsnp vcfs on the other hand use (NC.....,NW....,NT.....).

I might be wrong about this but I suspect that dbsnp vcfs (upto recent v156 at least),despite having more than a billion SNPs in their database, have poor overlap with 1240k SNPs.

Tomenable · (This post was last modified: 02-02-2024, 08:10 PM by Tomenable.)

Shouldn't this thread be moved to Tutorials > Other subforum?

Login
Username/Email:
Password:	Lost Password?
	Remember me