注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

云之南

风声,雨声,读书声,声声入耳;家事,国事,天下事,事事关心

 
 
 

日志

 
 
关于我

专业背景:计算机科学 研究方向与兴趣: JavaEE-Web软件开发, 生物信息学, 数据挖掘与机器学习, 智能信息系统 目前工作: 基因组, 转录组, NGS高通量数据分析, 生物数据挖掘, 植物系统发育和比较进化基因组学

网易考拉推荐

SAMTOOLS & SNP/INDEL  

2014-04-10 13:55:45|  分类: 生信分析软件 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |
http://www.novocraft.com/wiki/tiki-index.php?page=SNP+and+MicroIndel+detection
Remove PCR duplicates
It is highly recommended to remove PCR duplicates before proceeding with the SNP and Indel calls. We use samtools as an example to remove PCR duplicates.

#remove PCR duplicates from single-end reads
samtools rmdup -s reads1.bam reads1.sorted.bam

#For Paired reads
samtools rmdup reads1.bam reads1.pe.sorted.bam 

SNP and Microindel calling
In this example we use the samtools pileup SNP caller and accept reads with a minimum mapping quality of 20:

#The hg18.fasta reference genome sequence is required. samtools.pl is also required
samtools view -u -q 20 reads1.bam | samtools pileup -vc -f hg18.fasta - | samtools.pl varFilter -D 1000 -G 50 > variations.list

Filtering and final list generation
 A simple awk script can be used to filter our list of variations where the indel SNP quality is greater than 50 or 60 respectively:
awk '($3=="*"&&$6>=50)||($3!="*"&&$6>30)' variations.list > variations.filtered.list

Exporting to other formats
The samtools distribution contains a helper script called sam2vcf.pl that may be used to convert to the Variant Call Format (VCF):
sam2vcf.pl variations.filtered.list -r hg18.fasta > variations.vcf

MarkDuplicates is "more correct" in the strict sense. Rmdup is more efficient simply because it does handle those tough cases. Rmdup works for single-end, too, but it cannot do paired-end and single-end at the same time. It does not work properly for mate-pair reads if read lengths are different.

As to SNP calling, the possibly best strategy is to do realignment with GATK, cap base quality by samtools, call SNPs with GATK and then recalibrate variant quality with GATK afterwards (i.e., you still need samtools for one step). In addition, for now, GATK is (arguably) less accurate than samtools for multi-sample low-coverage SNP calling, but the next version of UnifiedGenotyper is coming out which completely closes the gap because GATK and samtools will be using the same SNP calling model. As to variant quality recalibration, I am little cautious of its use of ts/tv. This is just a personal opinion, though, and perhaps biased.
http://www.biostars.org/p/3917/
Useful samtools utilies:
1. samtools idxstats : This tool will provide statistics about how many reads have aligned to each sequence/chromosome in the reference genome. The input bam file must be sorted and indexed.
samtools idxstats <in.bam>
2. samtools flagstat : Simple stats about how many reads mapped to the reference, how many reads were paired properly etc. The input bam file must be sorted and indexed.
samtools flagstat <in.bam>
Example:
1. samtools mpileup -Euf reference.fna aln1.bam aln2.bam | bcftools view -bvcg - > var.raw.bcf
where reference.fna : reference, in fasta format
aln1.bam, aln2.bam : BAM files containing alignment results. You can use 1 or more alignment flies at a time.  Note that as of late 2011, the new BAQ filter seems to aggressively remove SNPs unless you "extend" it with the "-E" option.
2. bcftools view var.raw.bcf | vcfutils.pl varFilter -D10 > var.filtered.vcf
BCFtools does the actual calling of SNPS and the SNP information is stored in var.filtered.vcf. -D option is used to filter by depth of coverage at the SNP location.
Information about VCF file and other filter options at : http://samtools.sourceforge.net/mpileup.shtml
OLD VERSION: Commands to use samtools with a bam file, input.bam,
1. Use samtools pileup to call SNPs
samtools pileup -vcf reference.fna input.bam > out.pileup 2>out.log &
where reference.fna  : reference file, in fasta format
          input.bam  : BAM file containing alignment results
2. Filter the results further by snp quality:
samtools.pl out.pileup||awk '$6>=20' > out.final.pileup

#bowtie2-build roretzi.fasta roretzi
perl convert_FASTQ_quality.pl PE600_2.fq > ../PE600_2.fq

--fr/--rf/--ff 设定上下游reads和前导链paired-end比对的方向. --fr: 匹配时,
read1在5'端上游, 和前导链一致, read2在3'下游, 和前导链反向互补. 或者read2在
上游, read1在下游反向互补; --rf: read1在5'端上游, 和前导链反向互补, read2在
3'端下游, 和前导链一致; --ff: 两条reads都和前导链一致. Default: --fr. 默认
设置适合于Illumina的paired-end测序数据; 若是mate-paired, 则要选择—rf参数.

perl convert_FASTQ_quality.pl MP8000_1.fq > ../MP8000_1.fq &
perl convert_FASTQ_quality.pl MP8000_2.fq > ../MP8000_2.fq &

perl convert_FASTQ_quality.pl PE280April_1.fq > ../PE280April_1.fq &
perl convert_FASTQ_quality.pl PE280April_2.fq > ../PE280April_2.fq &

perl convert_FASTQ_quality.pl PE600_1.fq > ../PE600_1.fq
perl convert_FASTQ_quality.pl PE600_2.fq > ../PE600_2.fq

bowtie2-build roretzi.fasta roretzi
bowtie2 --phred33 -k 1 -I 1000 -X 9000 --no-mixed --no-discordant --rf -x roretzi -1 MP8000_1.fq -2 MP8000_2.fq -S roretzi_MP8000.sam
sleep 1s;
samtools view -bS roretzi_MP8000.sam > roretzi_MP8000.bam
sleep 1s;

bowtie2 --phred33 -k 1 -I 0 -X 500 --no-mixed --no-discordant --fr -x roretzi -1 PE280April_1.fq -2 PE280April_2.fq -S roretzi_PE280April.sam
sleep 1s;
samtools view -bS roretzi_PE280April.sam > roretzi_PE280April.bam
sleep 1s;

bowtie2 --phred33 -k 1 -I 0 -X 500 --no-mixed --no-discordant --fr -x roretzi -1 PE280August_1.fq -2 PE280August_2.fq -S roretzi_PE280August.sam
sleep 1s;
samtools view -bS roretzi_PE280August.sam > roretzi_PE280August.bam
sleep 1s;

bowtie2 --phred33 -k 1 -I 0 -X 1000 --no-mixed --no-discordant --fr -x roretzi -1 PE600_1.fq -2 PE600_2.fq -S roretzi_PE600.sam
sleep 1s;
samtools view -bS roretzi_PE600.sam > roretzi_PE600.bam
leep 1s;

#=============
samtools merge roretzi.bam roretzi_MP8000.bam roretzi_PE280April.bam roretzi_PE280August.bam roretzi_PE600.bam
sleep 1s;
samtools sort roretzi.bam roretzi.sorted
sleep 1s;
samtools mpileup -uf roretzi.fasta roretzi.sorted.bam > roretzi.sorted.bam.mpileup
sleep 1s;
bcftools view -bvcg roretzi.sorted.bam.mpileup > roretzi.sorted.bam.mpileup.bcf
#samtools mpileup -uf roretzi.fasta roretzi.sorted.bam | bcftools view -bvcg - > roretzi.sorted.bam.bcf
bcftools view roretzi.sorted.bam.mpileup.bcf > roretzi.sorted.bam.mpileup.vcf
vcfutils.pl varFilter -Q 20 -d 10 -p roretzi.sorted.bam.mpileup.vcf > roretzi.sorted.bam.mpileup.SNPs_INDELs

java -jar VarScan.v2.3.3.jar mpileup2snp HR_sequences/roretzi.sorted.bam.mpileup --output-vcf 1 > VarScan.vcf

#=============
nohup bash s_job_MP8000.sh > s_job_MP8000.nohup 2>&1 &
nohup bash s_job_PE280April.sh > s_job_PE280April.nohup 2>&1 &
nohup bash s_job_PE280August.sh > s_job_PE280August.nohup 2>&1 &
nohup bash s_job_PE600.sh > s_job_PE600.nohup 2>&1 &


fastq_quality_trimmer -Q 33 -t 20 -l 50 -i D0AJ9ACXX_AOL01_01P01_L1_L005_R1.fastq -o AOL01_01P01_R1 &
fastx_quality_stats: Invalid quality score value 
http://seqanswers.com/forums/archive/index.php/t-7399.html
I think as Simon suggests your (old) version of FASTX-toolkit is probably assuming you have Solexa/Illumina FASTQ (which have a narrow range of allowed characters), but you have Sanger FASTQ. Try the FASTX command line option -Q 33 here.:

Call SNPs and short INDELs for one diploid individual:
 samtools mpileup -ugf ref.fa aln.bam | bcftools view -bvcg - > var.raw.bcf
 bcftools view var.raw.bcf | vcfutils.pl varFilter -D 100 > var.flt.vcf
 The -D option of varFilter controls the maximum read depth, which should be adjusted to about twice the average read depth. One may consider to add -C50 to mpileup if mapping quality is overestimated for reads containing excessive mismatches. Applying this option usually helps BWA-short but may not other mappers.

java -classpath trimmomatic-0.22.jar org.usadellab.trimmomatic.TrimmomaticSE  -phred64 1_4000.fq 1_4000.fq.phred33 TOPHRED33
#!/usr/local/bin/perl -w
use strict;
use warnings;
# converts the quality values to standard phred+33 format
# Convert Solexa/Illumina FASTQ to the standard FASTQ
die "Usage: perl $0 phred+64 > phred+33\n" unless (@ARGV == 1);
# Solexa->Sanger quality conversion table
my @conv_table;
for (-64..64) {
  $conv_table[$_+64] = chr(int(33 + 10*log(1+10**($_/10.0))/log(10)+.499));
}
my $s = "";
open (IN, "$ARGV[0]") || die "cannot open $ARGV[0]\n";
while (<IN>) {
        $s = $_;
        $s =~ s/\s*$//g;
        if ($s =~ /^@/) {
          print "$s\n";
          $_ = <IN>;
          $s = $_;
          $s =~ s/\s*$//g;
          print "$s\n";
          $_ = <IN>;
          $_ = <IN>;
          $s = $_;
          $s =~ s/\s*$//g;
          my @t = split('', $s);
          my $qual = '';
          $qual .= $conv_table[ord($_)] for (@t);
          print "+\n$qual\n";
        }
}
close(IN);




Hi,

I wish you a happy new year and all the best for you and your family!

> I have got some results of SNP/INDEL analysis, including:

> 1. amino acid mutation in protein conding regions, (see
> roretzi.SNPs.mutant)
> 2. SNP destidution in whole genome regions, repeat, exon, intron and
> intergenic region, per 100 bp and 1000 bp window size.
> (roretzi.SNPs.region_100.stat, roretzi.SNPs.region_1000.stat)
>     Colud you please add the roretzi.SNPs.region_100.stat to Gbrowser?
>  so we can investigate which regions have more SNPs than others. For
> different regions, you can use differet colors.

I put in the same track the coding/intergenic/intron and repeat data
because for most of the time there is no overlap between them.
In the case of overlap the box will be superpose and the color will be a
mix of the colors associated.(and so the height of the box will not be
the sum -> may be is a disadvantage?)
Example : for the same position if there are repeat(yellow) and
intron(red) data the color of the box will be orange (yellow + red))

If you prefer 4 different tracks, tell me!

Note1 : It is possible in Gbrowse to display track either with public
permission, either with restricted permission (filter on the IP address
or domain). I put your track with public permission, but I can change it
if you prefer.

Note2 : If you want to display others tracks in Gbrowse later, is it
possible for you to ensure that the start and stop values ??of a defined
region does not exceed the length of the scaffold?
I have changed the roretzi.SNPs.region_100.stat file because I have
error to display it in Gbrowse.

Note3 : because your results depends of the size of window you choose
(100bp) there are sometimes overlap between SNPs "classified in intron"
with exons of the gene models.
If you want to reduce the size of the window (1bp?) the file size is not
a problem to display data in Gbrowse (it create binary file to reduce
the size).

> 3. donor and acceptor mutations in intron
> regions (roretzi.SNPs.intron.donor_acceptor)

> Besides, I only detected 411453 SNPs,  ~2.85 SNPs per 1k bp (411453 /
> 144236087)  (QUAL >= 20, depth >= 10, roretzi.SNPs.uniq)
> I think this frequency ( ~2.85 SNPs per 1k bp) is much smaller than
> you reported before, I am not sure the parameter of bowtie2 and
> samtools I used were too strict or not, so I pasted them at the bottom
> of this email. Do you have some suggestions or comments about them?
We give an approximation of the frequency of SNP, but I have not look in
details the real frequency. My comments below.

> In addition, for aurantium genome, I have not got the gene model, I am
> not sure you have predicated the gene or not.
indeed they were not predicted!

Best,

Christelle



> #===============================================================================
> bowtie2-build roretzi.fasta roretzi
> bowtie2 --phred33 -I 100 -X 12000 --fr -x roretzi -1 MP8000_1.fq -2
> MP8000_2.fq -S roretzi_MP8000.sam
the MP8000 is a Mate Pair library so you should to used --rf option and
not --fr (the first read is in the reverse side and the second in the
foward side)
you can try to be more stringent and take -I 1000 -X 9000
(the theorical insert size of MP8000 is 8000b, but if you plot the
distribution, you will see that the insert size is ~5000b)

> samtools view -bS roretzi_MP8000.sam > roretzi_MP8000.bam
Note : in the roretzi_MP8000.sam you have all aligments : if a read has
multiple position in the genome you will find several times this read in
the file with a match on different scaffolds. Maybe you should keep the
reads with only one match on the genome.

  评论这张
 
阅读(2059)| 评论(0)
推荐 转载

历史上的今天

在LOFTER的更多文章

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2016