DNA Sequence Assembly and Analysis – It’s a Whole New World!
By DNA Star
DNA Star | www.dnastar.com
Next generation DNA sequencing instruments, introduced to the market in the past few years, are beginning to dominate the landscape of DNA sequencing projects – and rightfully so. The new technologies currently in the market have advanced the state of the art regarding DNA sequencing by 10,000 fold or more from traditional Sanger sequencing methods. The throughput is orders of magnitude greater, the cost per base is dramatically reduced, and the volume of data available for assembly and analysis is far larger than it’s ever been before. Scientists today have three choices for handling this abundance of data:
- Develop their own software;
- Use freeware; or
- Use commercial software.
The first option, developing their own software, is generally one that can only be addressed by larger, well funded institutions that have the resources to devote to software development. While attractive in some instances, such software tends to have limitations since the developers are producing it for a small audience and do not have to be concerned with the demands of a large, diverse group of researchers.
The second option at first sounds attractive – get something that can be helpful and not have to pay for it. In many instances the reality is that freeware that is presently available for use with next generation sequencing applications is very specifically focused. Researchers who want to perform a variety of analyses seldom find that one or even two freeware packages can be located to address all of their needs. Because the software is not written with the intention of addressing broad market requirements, it is frequently difficult to use and, unfortunately, technical support for such products is rarely available.
What is a scientist to do then?
The answer for the majority of scientists lies in the third option, using updated tools from a trusted provider of DNA sequence assembly and analysis software, such as DNASTAR, who has been providing such software to life scientists for the past 25 years. DNASTAR is uniquely positioned to provide molecular biologists with the software tools they need to work with the emerging technologies because it is the only company that independently develops and sells software for use on multiple next generation sequencing platforms and in support of microarray gene expression analysis.
Following is a description of the DNASTAR tools currently available to meet the wide variety of life scientists’ needs today and a brief overview of what is coming soon. The discussion will be divided into four distinct parts:
- Whole genome sequence assembly and analysis
- Mutation detection and analysis
- RNA-Seq and related sequence depth quantification analyses
- Other applications
Whole Genome Sequence Assembly and Analysis
DNASTAR was started as a company based on the work in the lab of Dr. Frederick Blattner at the University of Wisconsin, who, in 1997, was the first person to lead a team to complete the sequencing of the E. coli genome, which was the largest genome completed to that date. Who better to provide tools to life scientists than the company that supported Dr. Blattner in the accomplishment of his work in the 1990s?
With the help of a National Institutes of Health grant, DNASTAR developed a tool for whole genome assembly, including addressing the most complicated issues involved with these assemblies (repeated regions, insertions, deletions, etc.) This tool, now called SeqMan NGen has been developed and fine tuned to work well with Illumina®, Roche 454® and Sanger sequence data. It uses paired-end reads (when they are available) to assist in the assembly process and, through fine tuning of a wide variety of available parameters by the user, it has helped many researchers achieve outstanding results in assembling a whole genome. It is well suited for both de novo and templated assembly projects.
SeqMan NGen runs on a 64 bit Windows or Macintosh computer and is a trusted tool for working with data generated in several lanes of Illumina’s next generation sequencing instrument, a full Roche 454 next-generation sequencing run and/or a comparable amount of Sanger data. The use of paired-end reads is critical to completing a whole genome sequence assembly project, which SeqMan NGen handles well. In addition, it can often be helpful in addressing the final details of a whole genome assembly problem to use sequences generated by multiple platform technologies (e.g. Illumina and Roche 454 or Roche 454 plus Sanger reads). SeqMan NGen handles multiple sequence types sequentially, in parallel, or in combination – you choose your preferred approach and SeqMan NGen can handle it.
SeqMan NGen is strictly an assembly tool. Its companion product, the SeqMan® module within the Lasergene® software suite, helps scientists put the finishing touches on their sequence assembly projects. Lasergene, including SeqMan, has been the workhorse for thousands of molecular biology laboratories globally for many years. As sequencing technology has advanced, SeqMan and the rest of Lasergene have also advanced to the point where more than 10,000 authors have cited the company’s sequence analysis software when publishing their work.
SeqMan provides visualization of sequences and contigs for deep analysis to allow the scientist to complete any work not able to be completed in the initial assembly. Annotations can be brought into SeqMan with sequence reads from SeqMan NGen, created directly in SeqMan, or collected through a BLAST search in SeqMan to assist with the final assembly work.
At DNASTAR, where E. Coli-sized projects have been the company’s bread and butter for many years, a combination of SeqMan NGen and SeqMan were recently used to assemble sequences from several previously unpublished E. coli strains against annotated template sequences. The entire process took just a few days from receiving the data to completing the finished genomes.
Due to the effective use of paired end read data, de novo assemblies are also able to be completed using SeqMan NGen and Lasergene. Normally, these will take additional time to complete, but the DNASTAR toolset supports these projects very well.
DNASTAR’s current whole genome assembly and analysis technology is currently limited by the constraints of desktop computers. While much of the assembly and analysis part of smaller projects can be performed on standard laptop computers, larger assembly projects where gigabytes of data must be processed, require newer 64-bit processor computers with larger amounts of RAM to work efficiently and generate results quickly. Work continues on improvements to existing algorithms and new approaches to increasing the size of genomes with which DNASTAR’s tools are effective. With the company’s goal being to release upgraded versions of each of its software products on virtually a continuous basis, DNASTAR’s tools will be able to assemble human genome data on desktop computers in the not too distant future.
Mutation Detection and Analysis
Much of the research being performed today relates to identifying mutations when comparing different samples. This includes research in support of identifying causes of various diseases (e.g. detection of Single Nucleotide Polymorphisms (SNPs)) as well as more general work in support of identifying mutations in one sample versus another or when compared to a reference sample. Next generation DNA sequencing has dramatically reduced the cost of this type of work. In some instances, by focusing on a small set of genes, researchers can gain deep coverage of many experiments for a few thousand dollars or less.
Mutation detection research hits a sweet spot for DNASTAR, given the tools developed in support of this area of research in the past 25 years, including new capabilities added to deal with the unique challenges posed by the shorter read lengths and vastly increased depth of coverage associated with next generation DNA sequencing technologies.
SeqMan NGen is once again the starting point for mutation detection work for initial assembly of target sequences against a reference or template. A special parameter has been added to SeqMan NGen to allow the user to do one or more “SNP passes” after the initial assembly is completed. Work with next generation samples has shown that these SNP passes, which attempt to assemble the sequence reads against themselves, after doing the best job possible with the initial templated assembly process, help fill in gaps created in the initial assembly to make the end result conform more closely to expected mutation results, with as many gaps as possible filled.
After SeqMan NGen’s work is completed, the assembly project is once again passed to SeqMan for in-depth review and analysis. As mentioned previously, annotations can be included in the project in a variety of ways, including the creation of new annotations to allow the scientist to mark areas of interest in any manner desired. In addition, DNASTAR has added SNP filtering capabilities to SeqMan, whereby the user can set a minimum and/or maximum level of SNPs as a percent of total coverage at a particular base pair location. This capability allows for homozygous as well as heterozygous SNPs to be displayed, depending on percentages selected (e.g. selecting SNP percentages between 30% and 70% of total reads would allow for easier detection of heterozygous SNPs).
Details regarding specific SNPs are provided, including comparison of the reference base call to the read base call. From this information, SeqMan identifies codon changes and, as shown in the display below, allows for the insertion and naming of a feature specific to that location, if desired.
SeqMan has multiple views within the program, including:
- SNP statistics. This view includes the SNP summary report (shown above) or a more detailed table detailing characteristics SNP by SNP for in-depth analysis.
- Alignment view, showing the detail of the individual reads compared to the reference sequence, highlighting individual SNPs and distinctly marking putative SNPs, confirmed SNPs and rejected SNPs.
- Strategy view, showing the “20,000 foot” view of the sequence and related reads. This view, in which the user can zoom in or zoom out at will, also includes a histogram displaying depth of coverage along the reference sequence.
- Quality scores for individual base calls are available for all sequence types, including a flow-gram visualization for Roche 454 data (shown below, including quality scores, which are optional).
Many scientists keep multiple windows open for the various visualizations and scroll through each view using the filtered SNP report as the control to easily view all aspects of key mutations at one time. Whatever way it is done, the tools available in SeqMan NGen and Lasergene provide powerful assistance to the researcher interested in mutation detection.
RNA-Seq and Related Applications
With the advent of low cost next generation DNA sequencing technologies, many scientists are beginning to move towards the use of DNA sequence data for applications for which they previously used microarrays. The level of precision and accuracy able to be obtained using DNA sequence information provides a preferred approach to looking at mRNA or DNA gene expression, assuming the cost of using these technologies is affordable, which it has now become.
In an RNA-Seq experiment, the scientist uses the depth of coverage for sequencing reads as a measure of the level of expression, similar to the use of signal intensity in microarray experiments. The primary tools required to analyze RNA-Seq experiments are:
- Tools to read large volumes of sequence reads.
- Tools to analyze depth of coverage, convert it to expression values by gene or other area of interest, and visualize multiple experiments compared to each other.
DNASTAR’s background with DNA sequence assembly and analysis, including next generation sequencing technologies, has positioned the company exceptionally well to support RNA-Seq and related applications. In addition, beginning in 2007, DNASTAR ventured into the microarray gene expression world with a new tool, ArrayStar, that provides microarray analysis and visualization capabilities.
By combining DNASTAR’s sequence assembly expertise with its microarray gene expression tools, DNASTAR has developed an RNA-Seq module for ArrayStar. This new RNA-Seq module allows for the reading in and summarizing of expression levels of sequence data. The remainder of the gene expression capabilities that already existed in ArrayStar (scatter plots, heat maps, clustering algorithms, etc.) can be used to compare multiple RNA-Seq experiments, an RNA-Seq experiment to a related microarray experiment or other combinations, as desired by the scientist.
RNA-Seq capability, incorporated as a module of the ArrayStar gene expression product, represents a first step in combining DNA sequence and microarray capabilities in DNASTAR’s software. However, this is only the first step. As the DNA sequencing and microarray worlds converge, the software tools available to molecular biologists for more in-depth genomic studies will become even more important to provide needed analysis capabilities.
As the world of next generation DNA sequencing continues to grow and expand, many other new applications of this technology are evolving. One of the most prominent of these new applications is ChIP-Seq, closely related to ChIP-chip experiments that have historically been performed. Both approaches refer to a procedure used to determine whether a given protein binds to a specific DNA sequence, one approach using DNA sequence data and the other approach relying on microarray chips. Both approaches will likely continue to have applicability in the future, with the microarray approach providing approximate information at low cost, while the sequence based approach can provide a level of specific information that will be worth the additional cost. Also, many labs have voluminous amounts of microarray data already available. To avoid reproducing this data using DNA sequencing techniques, users will want software tools that can handle both sequence and microarray data together in appropriate visualization and analysis tools.
As these and other applications emerge, DNASTAR is committed to providing tools to support these new applications and combinations of new applications with historical approaches in a comprehensive, easy to use set of software tools, just as the company has done for more than 25 years. The next generation DNA sequencing technologies truly have opened up a whole new world of scientific knowledge and possibilities. DNASTAR is committed to maintaining its leadership position in this new world by providing timely and appropriate analytical solutions for life scientists involved in using these new technologies.
For more information on any or all of the products or applications discussed here, please visit www.dnastar.com.