登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

云之南

风声,雨声,读书声,声声入耳;家事,国事,天下事,事事关心

 
 
 

日志

 
 
关于我

专业背景:计算机科学 研究方向与兴趣: JavaEE-Web软件开发, 生物信息学, 数据挖掘与机器学习, 智能信息系统 目前工作: 基因组, 转录组, NGS高通量数据分析, 生物数据挖掘, 植物系统发育和比较进化基因组学

SAMtools的使用常见问题  

2010-06-08 11:20:33|  分类: 生物信息学 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

http://blog.sina.com.cn/s/blog_4af3f0d20100hzl0.html


About the SAM/BAM Format

How can I get alignments in the SAM/BAM format?

Many alignment programs generate SAM/BAM natively or output a format that can be converted to SAM/BAM. Please check out this page for the complete list. If your preferred software is not in this list, you may contact the developers or write your own, and then please let us know.


How unaligned reads are stored in SAM?

An unaligned reads must be flagged with 0x4. It may have no coordinate (i.e. a coordinate `*:0'), but may have an ordinary coordinate with the CIGAR field set to `*'. In SAM, if one read in a read pair is aligned but the mate is not, we strongly recommend to set the coordinate of the unmapped read the same as that of the mapped one such that in a position sorted SAM/BAM file, the unmapped read is adjacent to the mapped. This convention greatly helps local assembly when we want to collect all related reads in a small region.


CIGAR is `50M', but I see mismatches in the alignment.

CIGAR operation `M' means `alignment match' (i.e. not a gap). It may be a `sequence match' or a `sequence mismatch'. Mismatching information is stored in the `MD' tag which is optional but can be generated with the `calmd' samtools command. We are proposing new CIGAR operations `=' for sequence match and `X' for sequence mismatch, but they are not well supported by samtools.


About SAMtools

How to convert SAM to BAM?

If your SAM file has header @SQ lines, you may get BAM by:

samtools view -bS aln.sam > aln.bam  

If not, you need to have your reference file ref.fa and then do this:

samtools faidx ref.fa samtools view -bt ref.fa.fai aln.sam > aln.bam  

The second method also works if your SAM file has @SQ lines. After conversion, you would probably like to sort and index the alignment to enable fast random access:

samtools sort aln.bam aln-sorted samtools index aln-sorted.bam  

I want to call SNPs and short indels.

For a short answer, do this:

samtools pileup -vcf ref.fa aln.bam | tee raw.txt | samtools.pl varFilter -D100 > flt.txt awk '($3=="*"&&$6>=50)||($3!="*"&&$6>=20)' flt.txt > final.txt  

For a long answer, see this protocol. Please always remember to set the maximum depth (-D) in filtering.


I want to call SNPs from one chromosome only.

Index your alignment with the `index' command and:

samtools view -u aln.bam chr10 | samtools pileup -vcf ref.fa - > chr10.raw.txt  

Please read this page for more information.


The integer FLAG field is not friendly to eyes.

You may get string FLAG by:

samtools view -X aln.bam | less -S  

For more information, please check out:

samtools view -?  

I do not understand the columns in the pileup output.

This is explained in the manual page. Or briefly (when you invoke pileup with the -c option):


reference sequence name
reference coordinate
reference base, or `*' for an indel line
genotype where heterozygotes are encoded in the IUB code: M=A/C, R=A/G, W=A/T, S=C/G, Y=C/T and K=G/T; indels are indicated by, for example, */+A, -A+A or +A/*.
Phred-scaled likelihood that the genotype is wrong, which is also called `consensus quality'.
Phred-scaled likelihood that the genotype is identical to the reference, which is also called `SNP quality'. Suppose the reference base is A and in alignment we see 17 G and 3 A. We will get a low consensus quality because it is difficult to distinguish an A/G heterozygote from a G/G homozygote. We will get a high SNP quality, though, because the evidence of a SNP is very strong.
root mean square (RMS) mapping quality
# reads covering the position
read bases at a SNP line (check the manual page for more information); the 1st indel allele otherwise
base quality at a SNP line; the 2nd indel allele otherwise
indel line only: # reads directly supporting the 1st indel allele
indel line only: # reads directly supporting the 2nd indel allele
indel line only: # reads supporting a third indel allele

If pileup is invoked without `-c', indel lines and columns between 3 and 7 inclusive will not be outputted.


I see `*' in the pileup sequence column. What are they?

A star at the sequence column represents a deletion. It is a place holder to make sure the number of bases at that column matches the read depth column. Simply ignore `*' if you do not use this information.


I only want to use a subset of alignments in pileup.

If you want to filter on mapping quality, flags, one read group or one library, you may just use the view command. If want to apply more complex filters, you may write an awk command for SAM. For example, I only want to use alignment with two or fewer differences (mismatches+gaps):

samtools view -h aln.bam | perl -ne 'print if (/^@/||(/NM:i:(\d+)/&&$1<=2))' | samtools pileup -S - > out.txt  

or exclude all gapped alignments:

samtools view -h aln.bam | awk '$6!~/[ID]/' | samtools pileup -S -  

Does samtools generate the consensus sequence like Maq?

Yes. Try this:

samtools pileup -cf ref.fa aln.bam | samtools.pl pileup2fq -D100 > cns.fastq  

Again, remember to set -D according to your read depth. Note that pileup2fq applies fewer filters in comparison to varFilter, and you may see tiny inconsistency between the two outputs.


I want to get `unique' alignments from SAM/BAM.

We prefer to say an alignment is `reliable' rather than `unique' as `uniqueness' is not well defined in general cases. You can get reliable alignments by setting a threshold on mapping quality:

samtools view -bq 1 aln.bam > aln-reliable.bam  

You may want to set a more stringent threshold to get more reliable alignments.


In repetitive regions, SAMtools call all bases as 'A' although there are no 'A' bases in reads.

This is due to a floating underflow in the MAQ SNP calling model used by default and only happens in repetitive regions. These calls are always filtered out. However, if you are uncomfortable with this, you may use the simplified SOAPsnp model with:

samtools -avcf ref.fa aln.bam > raw.txt  

The MAQ model and SOAPsnp model usually deliver very similar SNP calls.


How are SNPs and indels called and filtered by SAMtools?

By default, SNPs are called with a Bayesian model identical to the one used in MAQ. A simplified SOAPsnp model is implemented, too. Indels are called with a simple Bayesian model. The caller does local realignment to recover indels that occur at the end of a read but appear to be contiguous mismatches. For an example, see this picture.

The varFilter filters SNPs/indels in the following order:


d: low depth
D: high depth
W: too many SNPs in a window (SNP only)
G: close to a high-quality indel (SNP only)
Q: low root-mean-square (RMS) mapping quality (SNP only)
g: close to another indel with more evidence (indel only)

The first letter indicates the reason why SNPs/indels are filtered when you invoke varFilter with the `-p' option. A SNP/indel filtered by a rule higher in the list will not be tested against other rules.


The Windows version of SAMtools does not work sometimes.

We are sorry that this is due to bugs in the Windows port. The Windows version is mainly meant to be a cross-platform viewer. Most of samtools functionality are not tested. For heavy use of samtools, please run it on Linux machines instead.


For Developers

How to make my aligner work best with samtools?

To get the best performance in SNP calling, we recommend the following rules.


Try to generate all the mandatory fields, in particular the mate coordinates and insert size. Samtools' rmdup relies on the ISIZE field and Picard MarkDuplicates requires the mate coordinates. One may use the fixmate command afterward, but that is very inefficient.
Choose a random position for a repetitive read. If an aligner discards repetitive reads, the read depth will be inaccurate, which may cause problems in filtering SNPs.
Write mapping quality. It is recommended to compute mapping quality as samtools SNP caller may take advantage of this information. However, computing mapping quality requires an aligner to look into suboptimal hits and thus slows down alignment. If your aligner cannot do this, write 0 for repetitive reads and 60 for `unique' reads.

Why mapping quality?

The plot below shows alignment accuracy for 108bp simulated reads under different configurations of BWA. If we only retain `unique' alignment, we get a single spot ungap-se-unqiue which corresponds to ~2300 wrong alignments out of 1.68 million mapped reads. If we look at the mapping quality generated by BWA and set a stringent threshold on that, it is possible to get an accuracy of 400/1.67M (the ungap-se line). That is saying we get >80% fewer false alignments at the cost of 1% loss in sensitivity. Setting a higher threshold further reduces false alignments and helps to reduce noises in identifying structural variations bridging unique regions. The plot may vary with the aligner in use, but it is generally true that an algorithm seeing more suboptimal alignments is more accurate. File:108.png

来源:http://sourceforge.net/apps/mediawiki/samtools/index.php?title=SAM_FAQ#Why_mapping_quality.3F

  评论这张
 
阅读(6053)| 评论(1)

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2018