MIRA-related(An Automated Genome and EST Assembler, 作者:Bastien Chevreux 最新版本:V2.9.46, 22/06 2009)
MIRA是一个多通道(multipass)组装软件,主要用于基因组或EST(表达序列标签)数据. 它能够处理各种混合型数据,如Solexa, 454, 3730等,并且它能用于检测重复序列和SNPs;
但MIRA需要比较长的时间运行,尤其你在处理mate pair(双末端测序序列)信息时;
MIRA采取逐步递进式精化组装的contigs。
MIRA采用了一个跟DNPTrapper处理SNP的相似策略;
MIRA会变得混乱,如果你的输入的DNA序列是来自不同个体的 - 他们会有着很大不同的SNPs位点。此时你则必须要告诉MIRA你的DNA序列是来自不同个体的基因组序列。
一些意见:
用AMOS检验
预先筛选基因组的n-mer重复序列. 这将给你的组装带来期望的结果。
预先筛选您的序列为你预料到目前和屏蔽这些重复序列。
用Blast比对你的contigs - 看结果是否有意义的?
用Blast比对你的singlets - 看看是不是你所期望看到的(e.g. sequences that translate into missing elements of pathways?)
Assembly and analysis组装软件
gsAssembler - aka Newbler, from Roche. Version 2.3 was just released before the workshop. GenePool users can download the software and manuals.
Velvet
Curtain
MIRA
Abyss
The R libraries ShortRead and Rolexa
Minimus2
Phrap
Celera assembler and CABOG
Bambus - plus a document on how to use Bambus with MIRA
Visualising assemblies组装可视化软件
gsAssembler - aka Newbler, has a graphical interface.
Hawkeye
Tablet
Consed
Gap5
CLCBio
EagleView
Other software其他相关应用软件
simhtsd - Given a reference sequence, simhtsd creates a large set of short nucleotide reads, simulating the output from high throughput DNA sequencers such as the Illumina Genome Analyzer II.
MAQ simulate - another way to simulate short nucleotide reads. Part of the MAQ suite.
Make - a utility program, usually used for specifying the building of executable programs. Also suggested as a useful
scripting tool for running pipelines such as assembly pipelines.
Maker -a genome annotation pipeline for smaller eukaryotic and prokaryotic genome projects to annotate their
genomes and to create genome databases.
RAST annotation server
我自己也写了一个模拟序列程序,可以模拟DNA序列,并同时考虑有variation(SNPS, Indels, SVs),甲基化(Methylation),测序错误,各种NGS形式的序列,单末端或双末端,BAC序列,加adpater,primier,vector等--simulateSeq (v0.1.9,但尚未发布)
后续工作:
Outcomes
Script everything. This helps cut down unnecessary work, and aids in reproducibility and accountability. On
Share scripts. A discussion was held about how and where a useful script repository could be set up. Tools such as trac and subversion were mentioned. This is probably a topic to be followed up on. In the meantime, the NERC Environmental Bioinformatics Centre has a webpage where utility scripts for handling new sequencing da
Use metrics. Mark Blaxter prepared an overview of the types of metrics discussed and trialled during the workshop. These are a place to start when assembling sequence da
Base level metrics
high length of contigs (and full length relative to expected genome size)
high N50 few contigs in N50
many contigs over 1 kb over 10kb
longest contig
All of these need to be balanced with quality of the assembly of course.
Comparative assembly metrics
eveness of coverage of contigs
same assembly achieved with several parameter sets
same assembly achieved with different programs/algorithms
same assembly achieved with different da
Read distribution metrics
low propoation of reads rejected as singletons (genome on
even coverages of reads over assembly (congurent with expected fold coverage)
correct spacing of paired reads in assembly
Biological affirmation
synteny with related genome
breaks/contig ends map to likely repetitive elements (Tn)
congruent with transcriptome da
congruent with restriction map (optical) map or genetic map
源自:
Summary report on the Next Generation Sequcen Assembly workshop and NextGenBug meeting held from
Nov 30 - Dec 2, 2009 in Edinburgh.
Next Generation Sequencing Assembly Workshop
eScience Institute, Edinburgh, November 30 - December 2, 2009
Full program and list of presenters. All of the presentations will be available on the eScience Institute website.
This workshop was funded by the Scottish Bioinformatics Forum, the National eScience Institute and the GenePool
评论