Many alignment programs generate SAM/BAM natively or output a format that can be converted to SAM/BAM. Please check out this page for the complete list. If your preferred software is not in this list, you may contact the developers or write your own, and then please let us know.
An unaligned reads must be flagged with 0x4. It may have no coordinate (i.e. a coordinate `*:0'), but may have an ordinary coordinate with the CIGAR field set to `*'. In SAM, if one read in a read pair is aligned but the mate is not, we strongly recommend to set the coordinate of the unmapped read the same as that of the mapped one such that in a position sorted SAM/BAM file, the unmapped read is adjacent to the mapped. This convention greatly helps local assembly when we want to collect all related reads in a small region.
CIGAR operation `M' means `alignment match' (i.e. not a gap). It may be a `sequence match' or a `sequence mismatch'. Mismatching information is stored in the `MD' tag which is optional but can be generated with the `calmd' samtools command. We are proposing new CIGAR operations `=' for sequence match and `X' for sequence mismatch, but they are not well supported by samtools.
If your SAM file has header @SQ lines, you may get BAM by:
samtools view -bS aln.sam > aln.bam
If not, you need to have your reference file ref.fa and then do this:
samtools faidx ref.fa samtools view -bt ref.fa.fai aln.sam > aln.bam
The second method also works if your SAM file has @SQ lines. After conversion, you would probably like to sort and index the alignment to enable fast random access:
samtools sort aln.bam aln-sorted samtools index aln-sorted.bam
For a short answer, do this:
samtools pileup -vcf ref.fa aln.bam | tee raw.txt | samtools.pl varFilter -D100 > flt.txt awk '($3=="*"&&$6>=50)||($3!="*"&&$6>=20)' flt.txt > final.txt
For a long answer, see this protocol. Please always remember to set the maximum depth (-D) in filtering.
Index your alignment with the `index' command and:
samtools view -u aln.bam chr10 | samtools pileup -vcf ref.fa - > chr10.raw.txt
Please read this page for more information.
You may get string FLAG by:
samtools view -X aln.bam | less -S
For more information, please check out:
samtools view -?
This is explained in the manual page. Or briefly (when you invoke pileup with the -c option):
If pileup is invoked without `-c', indel lines and columns between 3 and 7 inclusive will not be outputted.
A star at the sequence column represents a deletion. It is a place holder to make sure the number of bases at that column matches the read depth column. Simply ignore `*' if you do not use this information.
If you want to filter on mapping quality, flags, one read group or one library, you may just use the view command. If want to apply more complex filters, you may write an awk command for SAM. For example, I only want to use alignment with two or fewer differences (mismatches+gaps):
samtools view -h aln.bam | perl -ne 'print if (/^@/||(/NM:i:(\d+)/&&$1<=2))' | samtools pileup -S - > out.txt
or exclude all gapped alignments:
samtools view -h aln.bam | awk '$6!~/[ID]/' | samtools pileup -S -
Yes. Try this:
samtools pileup -cf ref.fa aln.bam | samtools.pl pileup2fq -D100 > cns.fastq
Again, remember to set -D according to your read depth. Note that pileup2fq applies fewer filters in comparison to varFilter, and you may see tiny inconsistency between the two outputs.
We prefer to say an alignment is `reliable' rather than `unique' as `uniqueness' is not well defined in general cases. You can get reliable alignments by setting a threshold on mapping quality:
samtools view -bq 1 aln.bam > aln-reliable.bam
You may want to set a more stringent threshold to get more reliable alignments.
This is due to a floating underflow in the MAQ SNP calling model used by default and only happens in repetitive regions. These calls are always filtered out. However, if you are uncomfortable with this, you may use the simplified SOAPsnp model with:
samtools -avcf ref.fa aln.bam > raw.txt
The MAQ model and SOAPsnp model usually deliver very similar SNP calls.
By default, SNPs are called with a Bayesian model identical to the one used in MAQ. A simplified SOAPsnp model is implemented, too. Indels are called with a simple Bayesian model. The caller does local realignment to recover indels that occur at the end of a read but appear to be contiguous mismatches. For an example, see this picture.
The varFilter filters SNPs/indels in the following order:
The first letter indicates the reason why SNPs/indels are filtered when you invoke varFilter with the `-p' option. A SNP/indel filtered by a rule higher in the list will not be tested against other rules.
We are sorry that this is due to bugs in the Windows port. The Windows version is mainly meant to be a cross-platform viewer. Most of samtools functionality are not tested. For heavy use of samtools, please run it on Linux machines instead.
To get the best performance in SNP calling, we recommend the following rules.
The plot below shows alignment accuracy for 108bp simulated reads under different configurations of BWA. If we only retain `unique' alignment, we get a single spot ungap-se-unqiue which corresponds to ~2300 wrong alignments out of 1.68 million mapped reads. If we look at the mapping quality generated by BWA and set a stringent threshold on that, it is possible to get an accuracy of 400/1.67M (the ungap-se line). That is saying we get >80% fewer false alignments at the cost of 1% loss in sensitivity. Setting a higher threshold further reduces false alignments and helps to reduce noises in identifying structural variations bridging unique regions. The plot may vary with the aligner in use, but it is generally true that an algorithm seeing more suboptimal alignments is more accurate.