Hello guest, if you read this it means you are not registered. Click here to register in a few simple steps, you will enjoy all features of our Forum.

Converting from  fastq to plink
#1
There was a good topic in the previous forum (https://genarchivist.freeforums.net/) regarding alignment  and converting from  fastq to plink .bed . However all that info is not available anymore. 

Could someone share all the steps again? Thanks .
Capsian20 likes this post
Reply
#2
If starting with FASTQ files... I'd recommend getting WGSExtract installed.

It can convert FASTQ and call them into BAM files. From there it will even produce RAW data extracts like 23&Me, AncestryDNA, etc.

With the RAW files, simple process from there to to make your PLINK Bed files.

https://wgsextract.github.io/
Capsian20, TanTin, miquirumba And 1 others like this post
Reply
#3
Here's what I had posted on the other forum:

I'm sure there are more efficent methods, but I'll give you two methods, one on usegalaxy.org and the other locally on a linux computer.

I assume you already have plink and know how to merge datasets

1) SETTING UP
-GALAXY: create an account on usegalaxy.org and upload your data (fastq or BAM file). You can use the upload data button on the right side and then choose a local file from your computer or use the paste/fetch data button to paste a url. Alternatively you can use the download tools under the Get Data header of the tools pannel on the left of the main window.
-LOCAL: a) install needed tools: bwa, bcftools, samtools, tabix

b) getting the hg19 reference genome (not needed on Galaxy since built in)
wget -O hg19.tar.gz http://hgdownload.cse.ucsc.edu/goldenPat...mFa.tar.gz
gunzip hg19.tar.gz
tar -xf hg19.tar
cat *.fa >> hg19.fa
rm *.fa
bwa index -a bwtsw hg19.fa

2) MAKING THE BAM FROM THE FASTQ
-GALAXY: use the tool "map with bwa-mem" on your fastq file by
a) selecting the reference genome in the second drop down (hg19)
b) selecting the single or paired-end option the 3rd drop down (usually single with ancient dna)
c) selecting your fastq file in the next drop down(s)

-LOCAL:
bwa mem -t 1 hg19.fa YOURFILE.fastq > YOURFILE.sam  (the -t option is the number of concurrent thread to use on your computer, it will mutiply RAM usage as well)
samtools view -@ 2 -bS YOURFILE.sam | samtools sort -@ 2 -m 2G -o YOURFILE.bam  (the -@ option is the number of processors to use and the -m the amount of RAM)
samtools index -@ 2 YOURFILE.bam

3) MAKING THE BCF FILE FROM THE BAM
-GALAXY: a) use "bcftools mpileup" tool on the output file of preceding step and select the reference genome (same as before = hg19)
b) use "bcftools call" tool on the output file of preceding step and select "1 - Treat all samples as haploid" in "Select predefined ploidy" under "file format options"

-LOCAL: a) bcftools mpileup -O b -o YOURFILE_mpiled.bcf -f hg19.fa YOURFILE.bam
b) bcftools call --ploidy 1 -m -O b -o YOURFILE.bcf YOURFILE_mpiled.bcf

4.1) PREPARING A FILE TO ANNOTATE WITH SNP IDS

start with the BIM file of a 1240k set
we need to keep only columns 1,2&4 and invert 2&4.  We also need to replace the chromosomes numbers in the first column by chr1,chr2,chr3...chr22,chrX,chrY

-GALAXY: a) upload the BIM file
b) use "cut" tool to get 1st and 4th column
c) use the "cut" tool again to get the 2nd column
d) use the "paste" tool to aggregate the 2 files together
e) use the "replace" tool to replace the chromosomes numbers of the first column, after having uploaded a file like this:
1  chr1
2  chr2
3  chr3
4  chr4
5  chr5
6  chr6
7  chr7
8  chr8
9  chr9
10  chr10
11  chr11
12  chr12
13  chr13
14  chr14
15  chr15
16  chr16
17  chr17
18  chr18
19  chr19
20  chr20
21  chr21
22  chr22
23  chrX
24  chrY

*** In the above steps you might lose the column numbering at some point, in the right panel where you see the resulting files, you might lose the blue line with the column numbers and the file type "tabular".  There are two ways of correcting that, one is editing the file type (pencil icon and then second tab), the other is to use the "convert" tool.

-LOCAL: use the tools of your choice to achieve the same result: from the BIM file keep first column but change the chromosomes names, put the fourth column next and then the 2nd one and erase the rest.

4.2) ANNOTATING WITH SNP IDS
-GALAXY: use "bcftools annotate" a) Under Add annotations in column, type: CHROM,POS,ID
b) For Annotations File, select 'From a BED or tab-delimited file' and select your annotation file that we prepared in the preceding step
c) as usual click on run tool at the bottom

-LOCAL: let's say the prepared annotation file from the BIM in the step above is called chr1240k
a) bgzip chr1240k
b) tabix -s1 -b2 -e2 num1240k.gz
c) bcftools annotate -O b -o YOURFILE_annotated.bcf -a chr1240k.gz -c CHROM,POS,ID YOURFILE.bcf

5) FILTERING (note that it is possible to do it in one step by excluding the expression: ID=="." || QUAL<25 || QUAL=="."
-GALAXY: a) keep only positions with the 1240k snps IDs: "bcftools filter" under Restrict to in the exclude box type: ID=="."
b) filter for quality: "bcftools filter" under Restrict to in the exclude box type: QUAL<25
c) cleanup empty QUAL fields in case there are some: "bcftools filter" under Restrict to in the exclude box type: QUAL=="."
* at step c) select the uncompressed VCF output type just above the run tool button

-LOCAL: a) bcftools filter -O b -o YOURFILE_clean1.bcf -e 'ID=="."' YOURFILE_annotated.bcf
b)  bcftools filter -O b -o YOURFILE_clean2.bcf -e 'QUAL<25' YOURFILE_clean1.bcf
c)  bcftools filter -o YOURFILE_clean.vcf -e 'QUAL=="."' YOURFILE_clean2.bcf

**IN ONE STEP: bcftools filter -o YOURFILE_clean.vcf -e 'ID=="." || QUAL<25 || QUAL=="."' YOURFILE_annotated.bcf

6) CONVERT TO PLINK
-GALAXY: you have to download the last resulting file (disk icon that appears at the bottom when you click on your file in the history pannel) and do it LOCALLY after having optionnaly renamed it YOURFILE_clean.vcf

-LOCAL: plink --const-fid 0 --vcf YOURFILE_clean.vcf --make-bed --out YOURFILE

7) LAST CLEANUP AND MERGE
a) try a merge with a dataset you already have: plink --bfile DATASETNAME --bmerge YOURFILE.bed YOURFILE.bim YOURFILE.fam --indiv-sort 0 --allow-no-sex --make-bed --out NEWDATASETNAME
b) If it gives you an error message of the kind: "Error: xx variants with 3+ alleles present" you will have to remove the allele because plink only supports two allele possibility at each variant/snp. Fortunately when we get this error plink creates the necessary file (NEWDATASETNAME-merge.missnp) and we just have to run:
  plink --bfile YOURFILE --exclude NEWDATASETNAME-merge.missnp --make-bed --out YOURFILE2
and attempt to merge again:
plink --bfile DATASETNAME --bmerge YOURFILE2.bed YOURFILE2.bim YOURFILE2.fam --indiv-sort 0 --allow-no-sex --make-bed --out NEWDATASETNAME

8) You're done except you might want to edit the fam file as usual.

Again, I'm sure there are more efficient ways, especially with multiple samples, but I went with what I know, hoping others will share their knowledge as well.
AimSmall, TanTin, ArmandoR1b And 3 others like this post
Reply
#4
Here are some improvements in step 3) a)

TO IMPROVE QUALITY:
In UseGalaxy:
Under "Input Filtering Options" section "Quality Options" instead of default choose "Set base and mapping quality options"
-in "per-Base Alignment Quality" select "disable BAQ"
-in "Minimum mapping quality" write 30
-in "Minimum base quality" write 30

Locally:
To add to "bcftools mpileup" command line
-B -q 30 -Q 30

TO SIGNIFICANTLY SPEED-UP PROCESS:
(restrict to 1240k Regions right from the mpileup step)
In UseGalaxy:
Under "Restrict to" section "Regions" select "Operate on Regions specified in a history dataset"
and then as "Regions File" select a file with the first two columns of the one prepared in step 4.1)

Locally:
add -R <Text file with chr and location columns> to the "bcftools mpileup" command line

--------

Also to automate the process for many samples one can use pileupCaller
https://github.com/stschiff/sequenceTool.../README.md

However one loses the possibility to filter for quality at step 5) because the quality in the bcf for a position will be a combination of the quality of all the samples at this position, so I would not advise it for low quality samples.
kolompar, Merriku, miquirumba And 2 others like this post
Reply
#5
Please ignore my last comment on losing the possibility to filter for quality at step 5) because of pileupcaller.  Actually, for low quality samples, IT IS NOT RECOMMENDED TO FILTER QUAL<25 (only QUAL=. which pileupcaller already does) or else we lose too many transitions and transversions reads and end up with a heavy bias towards the reference genome.

Instead it is trimming that is recommended (removing 2,4, or 6 bases on each end of each segment), for instance with trimbam from bamutil.

Just wanted to prevent others from making the same error I made!
Merriku, TanTin, AimSmall And 1 others like this post
Reply
#6
I would like to add something to crashdoc's excellent posts. You can add rsids to your vcf by annotating them with either a dbsnp vcf file or from a tab-delimited file as described by crashdoc. But you won't be able to annotate it successfully unless both your vcf file and the file you are using to annotate are using same chromosomal notations (coulmn 1 of vcf files). You can verify that by uploading both your vcf file and tab file/dbsnp file and isolating their first column with 'cut' function on usegalaxy. You can then download the cut columns and inspect them on notepad. I guess most vcf files will be using any of these 3 chromosomal notations (see under alias sequence names)
https://genome.ucsc.edu/cgi-bin/hgTracks...mInfoPage=

Now you can match column 1 of your vcf with the annotating file by using bcftools (either locally or on usegalaxy). 
First make a text file with-

 Old_chromsome_name(space)New_chromosome_name/one chromosome per line/ 

Then- 
*On usegalaxy - upload the text file>bcftools annotate>select you vcf as input>change annotations>Rename CHROM>select your text file>Run 
*Locally- 
bcftools annotate --rename-chrs yourchrrenamefile.txt yourfile.vcf
TanTin and crashdoc like this post
Reply
#7
Quote:4.1) PREPARING A FILE TO ANNOTATE WITH SNP IDS

start with the BIM file of a 1240k set
we need to keep only columns 1,2&4 and invert 2&4.  We also need to replace the chromosomes numbers in the first column by chr1,chr2,chr3...chr22,chrX,chrY

-GALAXY: a) upload the BIM file

- At step 4 I have a VCF file with some data like this:

1 2 3
10 C 60014
10 C 60015
10 T 60016
10 A 60017
10 A 60018

Here I already replaced CHR10 to 10 in col 1 . 
Column 2 is REF and col 3 is POS.

Regarding  BIM file of a 1240k set : do I use an empty file , same as the example above with CHR1 - CHR24 (Y),  or I can upload some bim file like an example:   v52.2_1240K_public.bim  from V52 dataset?
Reply
#8
(01-11-2024, 01:45 AM)TanTin Wrote:
Quote:4.1) PREPARING A FILE TO ANNOTATE WITH SNP IDS

start with the BIM file of a 1240k set
we need to keep only columns 1,2&4 and invert 2&4.  We also need to replace the chromosomes numbers in the first column by chr1,chr2,chr3...chr22,chrX,chrY

-GALAXY: a) upload the BIM file

- At step 4 I have a VCF file with some data like this:

1 2 3
10 C 60014
10 C 60015
10 T 60016
10 A 60017
10 A 60018

Here I already replaced CHR10 to 10 in col 1 . 
Column 2 is REF and col 3 is POS.

Regarding  BIM file of a 1240k set : do I use an empty file , same as the example above with CHR1 - CHR24 (Y),  or I can upload some bim file like an example:   v52.2_1240K_public.bim  from V52 dataset?

I would suggest to check for notations of all chromosomes in your VCF file as it might be using X and Y as opposed to 23 and 24 of the 1240k bim file.

You have to upload the bim file of any 1240k dataset. Isolating and rearranging columns of the bim file,as described in the steps, would produce a delimited tab file with CHROM, POS , ID columns.
TanTin likes this post
Reply
#9
I was able to convert a file. Few notes on this process.
1. I think I get better results by using " Map with BWA - map short reads (< 100 bp) against reference genome"
-this option may better read for short reads.

2. For small files I used Uncompressed files as output format, which allows me to see the result at each step. If file is compressed Galaxy can not display the content of the file.
Merriku likes this post
Reply
#10
One of the wasting time steps here is the creation of bim file for the annotation.
Instead of using Galaxy tools for cut / paste / replace, here is a fast script to prepare the file locally in R:


library(tidyverse)
info.bim = read.table("SouthernArc_Public.bim")
info.bim1 = subset(info.bim,select =  c(1,4,2) )
for (i in 1:22){
info.bim1$V1[info.bim1$V1 == i  ] <- paste0("chr" , i)
}
info.bim1$V1[info.bim1$V1 == 23  ] <- paste0("chr" , "X")
info.bim1$V1[info.bim1$V1 == 24  ] <- paste0("chr" , "Y")

write.table(info.bim1 , file = paste0( "info.bim" ) , sep = "\t",
            row.names = F, col.names = F , quote = F)
crashdoc, AimSmall, Merriku like this post
Reply
#11
I uploaded the converted files for Buran Kaya. ( https://ufile.io/e5kkh6hl )

Uploading also a 1240k bim file that can be used directly.
https://ufile.io/8e2d5rox
Reply
#12
Question for those with more experience:
How do we merge fastq files for the same individual ?

For example:
https://www.ebi.ac.uk/ena/browser/view/PRJEB67353

An individual with Sarmatian-related ancestry in Roman Britain
We have 5 fastq files

Sample Title:Offord Cluny 203645
All for the same individual:  Sample Alias:C10271

I created several BAM  files (map with bwa-mem),
then I tried mpileup    on multiple files:
bcftools mpileup  ( bcftools mpileup on data 8, data 7, and data 6)

At the end in the final fam file after converting vcf to plink  I got separate lines for each initial fastq file.  So it is not merging them but processing as separate individuals.
Reply
#13
(01-15-2024, 06:21 PM)TanTin Wrote: Question for those with more experience:
How do we merge fastq files for the same individual ?

For example:
https://www.ebi.ac.uk/ena/browser/view/PRJEB67353

An individual with Sarmatian-related ancestry in Roman Britain
We have 5 fastq files

Sample Title:Offord Cluny 203645
All for the same individual:  Sample Alias:C10271

I created several BAM  files (map with bwa-mem),
then I tried mpileup    on multiple files:
bcftools mpileup  ( bcftools mpileup on data 8, data 7, and data 6)

At the end in the final fam file after converting vcf to plink  I got separate lines for each initial fastq file.  So it is not merging them but processing as separate individuals.

I am no expert, but usually people use pileupcaller and this is not a problem since it can ignore the read group names and take all read from the BAM.

However if you want to use the steps outlined above, before you merge the different samples with samtools (with the -c option if in commandline), all your sam/bams must have the same read group name.  One way to achieve this would be to add the -R option in the bwa step,but I have not tried it myself.  To use the -R option (in galaxy it is pretty straightforward, just select the set read group information option and fill in the read group ID) in commandline you have to specify: -R "@RG\tID:xxxxx\tPL:ILLUMINA" where xxxxx is the ID you want to assign, just make sure it is the same for all fastq you process for the same individual.  Again I have not tried it, so let us know how it goes!
Merriku, AimSmall, TanTin like this post
Reply
#14
(01-06-2024, 05:41 PM)Merriku Wrote: I would like to add something to crashdoc's excellent posts. You can add rsids to your vcf by annotating them with either a dbsnp vcf file or from a tab-delimited file as described by crashdoc. But you won't be able to annotate it successfully unless both your vcf file and the file you are using to annotate are using same chromosomal notations (coulmn 1 of vcf files). You can verify that by uploading both your vcf file and tab file/dbsnp file and isolating their first column with 'cut' function on usegalaxy. You can then download the cut columns and inspect them on notepad. I guess most vcf files will be using any of these 3 chromosomal notations (see under alias sequence names)
https://genome.ucsc.edu/cgi-bin/hgTracks...mInfoPage=

Now you can match column 1 of your vcf with the annotating file by using bcftools (either locally or on usegalaxy). 
First make a text file with-

 Old_chromsome_name(space)New_chromosome_name/one chromosome per line/ 

Then- 
*On usegalaxy - upload the text file>bcftools annotate>select you vcf as input>change annotations>Rename CHROM>select your text file>Run 
*Locally- 
bcftools annotate --rename-chrs yourchrrenamefile.txt yourfile.vcf

Adding on to this-
*the hg19 aligned BAMs/VCFs use (chr1-chr22,chrX,chrY) chromosomal notation.
*GRCh37 aligned BAMs/VCFs (Dantelabs for example) use (1-22,X,Y) chromosomal notation.
*while 1240k bim file uses (1-24).
*recent builds of the dbsnp vcfs on the other hand use (NC.....,NW....,NT.....).

I might be wrong about this but I suspect that dbsnp vcfs (upto recent v156 at least),despite having more than a  billion SNPs in their database, have poor overlap with 1240k SNPs.
TanTin likes this post
Reply
#15
Shouldn't this thread be moved to Tutorials > Other subforum?
Jalisciense and TanTin like this post
Reply


Forum Jump:


Users browsing this thread: 1 Guest(s)