Cytoscape is a bioinformatics software platform for visualizing molecular interaction networks and integrating these interactions with gene expression profiles and other state data. Additional features are available as plugins. Plugins are available for network and molecular profiling analyses, new layouts, additional file format support and connection with databases. Plugins may be developed using the Cytoscape open Java software architecture by anyone and plugin community development is encouraged. Cytoscape supports the following features:
Input and construct molecular interaction networks from raw interaction files (SIF format) containing lists of protein-protein and/or protein-DNA interaction pairs. For yeast and other model organisms, large sources of pairwise interactions are available through the BIND and TRANSFAC databases. User-defined interaction types are also supported.
Load and save previously-constructed interaction networks in GML format (Graph Markup Language).
Input mRNA expression profiles from tab- or space-delimited text files.
Load and save arbitrary attributes on nodes and edges. For example, input a set of custom annotation terms for your proteins, create a set of confidence values for your protein-protein interactions.
Import gene functional annotations from the Gene Ontology (GO) and KEGG databases.
Customize network data display using powerful visual styles.
View a superposition of gene expression ratios and p-values on the network. Expression data can be mapped to node color, label, border thickness, or border color, etc. according to user-configurable colors and visualization schemes.
Layout networks in two dimensions. A variety of layout algorithms are available, including cyclic and spring-embedded layouts.
Zoom in/out and pan for browsing the network.
Use the network manager to easily organize multiple networks.
Use the bird’s eye view to easily navigate large networks.
Plugins available for network and molecular profile analysis. For example:
Filter the network to select subsets of nodes and/or interactions based on the current data. For instance, users may select nodes involved in a threshold number of interactions, nodes that share a particular GO annotation, or nodes whose gene expression levels change significantly in one or more conditions according to p-values loaded with the gene expression data.
Find active subnetworks / pathway modules. The network is screened against gene expression data to identify connected sets of interactions, i.e. interaction subnetworks, whose genes show particularly high levels of differential expression. The interactions contained in each subnetwork provide hypotheses for the regulatory and signaling interactions in control of the observed expression changes.
Find clusters (highly interconnected regions) in any network loaded into Cytoscape. Depending on the type of network, clusters may mean different things. For instance, clusters in a protein-protein interaction network have been shown to be protein complexes and parts of pathways. Clusters in a protein similarity network represent protein families.
More plugins available on the plugins page.
Cytoscape was initially made public in July, 2002 (v0.8); the second release (v0.9) was in November, 2002. and v1.0 was released in March 2002. Version 1.1.1 is the last stable release for the 1.0 series. Version 1.1.1 has some of the features listed above and some additional features, such as the ability to add and remove nodes and undo actions.
GenMAPP is a free computer application designed to visualize gene expression and other genomic data on maps representing biological pathways and groupings of genes. Integrated with GenMAPP are programs to perform a global analysis of gene expression or genomic data in the context of hundreds of pathway MAPPs and thousands of Gene Ontology Terms (MAPPFinder), import lists of genes/proteins to build new MAPPs (MAPPBuilder), and export archives of MAPPs and expression/genomic data to the web.
The main features underlying GenMAPP are:
Draw pathways with easy to use graphics tools Color genes on MAPP files based on user-imported genomic data
Query genomic data against MAPPs and the GeneOntology with MAPPFinder
View the full specifications and features available in GenMAPP v.2.0 (GenMAPP Online Help). Download the full GenMAPP v2.0 software package and additional software accessories. Access GenMAPP related publications. Download interactive demos and program tutorials that illustrate how GenMAPP allows users to visualize data. "GenMAPP Concept" is a conceptual overview of GenMAPP. "How GenMAPP Works" is a brief introduction to the functions of GenMAPP, while the GenMAPP tutorials provide step-by-step instructions for how to create new MAPPs, import genomic data and various other functions for GenMAPP.
FRI > AI Lab > GenePathJun 22 2004GenePath
GenePath is a web-enabled intelligent assistant for the analysis of genetic data and for discovery of genetic networks. You can run GenePath by clicking on the link below. Please note that GenePath will open in a new window, so make sure that your browser does not block this.
>> Run GenePath (opens a new browser window)
GenePath uses abductive inference to elucidate network constraints and logic to derive consistent networks. Typically, it starts with a set of genetic experiments, uses a set of embedded rules (patterns) to infer relations between genes and outcome, and based on these relations constructs a genetic network. Below, for instance, is an example of mutant data (left) and a corresponding genetic network constructed by GenePath (right):
Notice that this is a new, second version of GenePath. It is developed with Microsoft ASP.NET technology, has a completely rebuilt interface, and is much faster than the previous version. Compared to the previous version which we describe in the paper published in Bioinformatics, it includes a number of new features:
expression data analysis
handling confidences assigned to genetic data
derivation of genetic networks with confidences computed for relations in the network
dealing with conflicts
treatment of cycles
enhanced what-if analysis
Read the GenePath guide to learn about GenePath web interface and how to use GenePath to enter, manage and analyze your data and discover genetic
A. SOM Clustering and Gene-Ontology Analysis of Microarray Data
GSCope2 calculates 2D Self-Organizing-Map clustering of DNA-microarray data, and assigns statistically significant functional definitions to each cluster. The definition includes various sorts of functional classifications, e.g. Gene Ontology, Metabolic Pathways, any user-defined classes using FBML: Functional Bio-network Markup Language. (Tutorial is available for more details. )
B. Integrated viewer for biomolecular network graphs We present a new visualization software for understanding multi-linked complex biomolecular network graphs by combining advantages of fisheye conversion function and clipping of connected structures. In analysis of biomolecular network graphs, it helps us to understand microscopic relationships between molecules, as well as to grasp an overview of a large whole graph. Biomolecular networks extracted from various sort of biological data tend to be highly complicated and incomprehensible because it is often the case that found connections among the molecules are far from simple, but intercrossed with many other connection lines.
C. Viewing and Analyzing expression data on biological pathways
DNA microarrays are widely used to measure the expression levels of thousands of genes simultaneously. GSCope is designed for viewing and analyzing gene expression data in the context of biological pathways, GSCope has a function that filters thousands of gene expression data to extract statistically similar expression profiles. According to the researchers' requests in GSC, the above-mentioned filter is one of the most necessary function for integrated database of microarrays and useful for biological network finding work by expert biologists. GSCope will be also helpful for analysis of protein expression levels.
D. Safe analysis of DNA microarray data through the Internet
It is very important that the users' experiment data should kept secret while using the databases. GSCope safely achieves the task by integrating pathway database and users' expression data on client computers (integration archived not at the database server, but at the users' computers), whereas several public pathway resources do such integration at the database servers and force users to submit their precious experiment data to the server.
INCLUSive is suite of web-based tools and is aimed at the automatic multistep analysis of microarray data. The goal is to provide an integrated platform where several sources of information can be linked together to facilitate the analysis of microarray data. Currently, preprocessing of microarray data, adaptive quality-based clustering, information retrieval of genes in clusters, retrieval of upstream sequences and motif finding algorithms are accessible from this website or from the results pages of the different analysis steps.
To facilitate intergration we also provide a set of SOAP web services for the different tools in INCLUSive. Check them out at the web portal.
NameDescriptionMaranMicroarray Analysis using ANOVAAQBCAdaptive quality-based clustering of gene expression profiles. Intergenic SelectSelect intergenic regions based on the accession number and the name of the gene of interest.MotifSamplerDetection of over-represented motifs in the upstream region of coregulated genes.MotifScannerDetection of pre-defined motif models in DNA sequences.ModuleSearcherModuleSearcher localizes modules of cooperating transcription factors.Go4GFind statistical significant GO terms in set of genes.
Mailing List If you like to be informed on updates and new versions of our programs, you should join the INCLUSive mailing list.
Gibbs BiClustering is a web-service that allows to search for specific patterns of subsets of genes and conditions witin a microarray data matrix.
TOUCAN is a workbench for regulatory sequence analysis on metazoan genomes, especially for detecting significant transcription factor binding sites. It is a platform independent, standalone Java application and was built using the BioJava package.
Mailing List If you like to be informed on TOUCAN, you should join the TOUCAN mailing list.
Endeavour is a software application for the computational prioritisation of test genes, based on a set of training genes. The ranking of a test gene is based on its similarity with the training genes, using many different information sources.
TXTGate is a web-service that combines literature indices of selected public biological resources in a flexible text-mining system designed towards the analysis of groups of genes. By means of tailored vocabularies, selected textual fields and MEDLINE abstracts of LocusLink and SGD are indexed. Subclustering and links to external resources allow for an in-depth analysis of the resulting term profiles.
The accompanying article "TXTGate: profiling gene groups with text-based information", was published in Genome Biology.
GeneTimer fetches temporal profiles for HUGO Gene Names from PUBMED abstracts. When you enter a gene name, a graph will pop up and show how this gene has emerged in biomedical literature throughout time. The application looks as far back as 1970, and includes PUBMED abstracts up until mid 2004.
M@CBETH is an online benchmarking tool to build an optimal two-class classifier with your microarray data.
Here are the links to the pages where you can download software or specific data sets.
INCLUSive Motif finding toolsDownload stand alone executable and auxiliary programs for our suite of motif finding tools.Background ModelsSelect and retrieve pre-compiled background models.Motif ModelsMatrix files of Transfac, SCPD, PlantCARE and Jaspar that can be used with TOUCAN, MotifScanner, MotifLocator, ...Expected FrequencyFiles with expected frequencies that can be used with TOUCANModuleSearcherjar file to run ModuleSearcher in a standalone fashion.RMAGEMLRMAGEML provides a link between cDNA microarray data stored in MAGE-ML format and the Bioconductor framework for preprocessing, visualization and analysis of microarray experiments.biomaRtbiomaRt is a new Bioconductor package that integrates BioMart data resources such as Ensembl, with data analysis in Bioconductor.
Dragon Promoter Finder
DRAGON Promoter Finder is part of the portal DRAGON Genome Explorer of the Knowledge Extraction Lab (KEL), Singapore. It searches for and locates promoters in anonymous genomic size DNA sequences. The program attempts to recognize the exact location of the transcription start site (TSS), i.e. the +1 position relative to the TSS. The analysis is strand specific.
The first output is a list of potential TSS. The program also includes very nice follow-up analyses, like BLAST against the EPD, or prediction of TF sites.
NNPP - Neural Network Promoter Prediction
NNPP, provided in the context of the Berkeley Drosophila Genome Project, is a widely used method that finds eukaryotic and prokaryotic promoters in a DNA sequence.
Note that test runs showed that NNPP is significantly less stringent than other promoter prediction programs, which results in a higher number of potential promoter sequence regions.
Promoter 2.0 Prediction Server
Promoter2.0 predicts transcription start sites of vertebrate PolII promoters in DNA sequences. It has been developed as an evolution of simulated transcription factors that interact with sequences in promoter regions. It builds on principles that are common to neural networks and genetic algorithms.
1. Specify the input sequences
The sequences intended for processing can be input in the following two ways:
Paste a single sequence (just the nucleotides) or a number of sequences in FASTA format into the upper window of the main server page.
Select a FASTA file on your local disk, either by typing the file name into the lower window or by browsing the disk.
Both ways can be employed at the same time: all the specified sequence will be processed.
The allowed input alphabet is A, C, G, T and X (unknown); all the other symbols will be converted to X before processing.
2. Select the output format
Click on the "Full output" button if you want the input sequences to be included in the server output. The default output format shows the predictions only.
3. Submit the job
Click on the "Submit" button. The status of your job (either 'queued' or 'running') will be displayed and constantly updated until it terminates and the server output appears in your browser window.
NOTE: At any time during the wait you may enter your e-mail address and simply leave the window. Your job will continue; you will be notified by e-mail when it has terminated. The e-mail message will contain the URL under which the results are stored; they will remain on the server for 24 hours for you to collect them.
Mammalian promoter prediction
PromoterInspector - highly specific prediction of promoter regions in mammalian genomic sequences
Predicting polymerase II promoter regions is a challenging task in bioinformatics. Until now there was no approach which provides sufficient specificity to be useful for experimental or in depth in silico analysis.
Polymerase II promoters are quite different in their individual organization but are recognized in a cell because of their common genomic context:
The functional organization of promoters is known to consist of individual transcription factor binding sites which are often combined in the form of functional promoter modules.
Genomatix offers a full range of tasks and libraries to detect such promoter elements (GEMS Launcher). However, these tasks are best suited for the in depth analysis of known promoter regions, not for their localization.
PromoterInspector is a highly complementary tool as it locates promoters as a first step for subsequent functional analysis with GEMS Launcher.
Prediction is based on context specific features which were extracted from training sequences (all mammalian sequences) by a heuristic free approach.
The novel idea of the PromoterInspector approach is the way of feature definition: Features are defined by equivalence classes of IUPAC groups which allow a fuzzy description of the promoter context. A prediction is based on the analysis of feature frequencies.
Function: Predicts Promoter regions based on scoring homologies with putative eukaryotic Pol II promoter sequences.
The analysis is done using the PROSCAN Version 1.7 suite of programs developed by Dr. Dan Prestridge, Information on PROSCAN, including details on obtaining a copy, is maintained at the Advanced Biosciences Computing Center, University of Minnesota.
Promoter extraction tools embedded in integrated sequence analysis suites
TOUCAN is a workbench for regulatory sequence analysis, especially for detecting significant transcription factor binding sites across species, and for detecting cis-regulatory modules (combinations of binding sites) in sets of coexpressed/ coregulated genes. It is a platform independent, standalone Java application that is tightly linked with Ensembl,and was built using the BioJava package. TOUCAN was developed by the bioinformatics group at the department of electrical engineering (ESAT) at the Katholieke Universiteit Leuven (Belgium). In Aug. 2004, version 2 was released, which has some very nice improved features.
1) RSAT-Retrieve Sequence: allows the automatic extraction of 5'-flanking sequences (pot. promoters) for your genes of interest. You have to choose the organism, in the case of human there are 2 different databases "Homo sapiens" (NCBI RefSeq sequences) and "Homo sapiens EnsEMBL" (EnsEMBL sequences). In test runs, there was no big difference in the output between these two.
The gene names must be separated by carriage returns, because only the first word of each line is considered as a query. Genes can be specified either by the systematic ORF identifier or by a common name. Synonyms are also supported. Note that the option "prevent overlap with upstream ORFs" should be inactivated when working with eukaryotes. "From To" describes the limits of the region to retrieve. For upstream sequences, the default reference position is the ORF start* (and NOT the transcription start !). Negative coordinates are used to indicate sequences located upstream the start codon; a reasonable pair of values could be: From -800 to -1.
Note that you might want to re-check the obtained sequence via BLAT search at UCSC.
*Please note that for genes which do NOT have the start ATG in the first exon the correct promoter retrieval might be a problem because in these cases the tool will retrieve sequence from the first intron, and NOT the promoter sequence !!! BUT NOW, the user can choose between different "Feature types", like CDS (Coding Sequence), mRNA, tRNA, etc. The advantage of using mRNA is that, if the mRNA is complete (which is not always the case), the upstream regions are retrieved relative to the transcription start site (TSS), rather than the start codon!!! If you want to see a nice example, you can try to extract the upstream sequence (e.g. -500 to -1) of the gene "SELE" (E-Selectin), and compare the output when choosing "CDS" versus "mRNA" as "feature type".
Release 5.0 (April, 2005)
Based on UCSC hg17, mm5
Most of the cDNA sequences stored in current databases lack the precise information of 5' end termini. To overcome this difficulty, DBTSS stores human sequences which were produced by the oligo-capping method to obtain full-length cDNAs. Sequence comparison between DBTSS and reference sequence database, RefSeq, revealed that 4,802 (34.2 %) of RefSeq sequences should be extended towards the 5' ends. DBTSS also mapped each sequence on the human genome sequence to identify its transcriptional start site.
DBTSS offers very good query options: RefSeq ID, UniGene ID, LocusLink ID, Gene Symbol, Ensembl Transcript ID, and more.
The output consists of very nice graphs showing the positions of RefSeqs and Ensembl-transcripts in relation to the positions of individual Oligo-capped cDNAs. You then can select "your favourite reference position" for the TSS, either RefSeq, ENST, or the longest Oligo-capped cDNA, and download the potential promoter region.
Note that DBTSS does NOT support batch queries !
Note that ESTs are NOT represented in DBTSS !
Mirror: DBTSS mirror in Germany.
EPD - Eukaryotic Promoter Database
The Eukaryotic Promoter Database (EPD) is an annotated non-redundant collection of eukaryotic POL II promoters, for which the transcription start site has been determined experimentally. Access to promoter sequences is provided by pointers to positions in nucleotide sequence entries. The annotation part of an entry includes description of the initiation site mapping data, cross-references to other databases, and bibliographic references.
NOTE: It is only possible to perform SRS- like keyword searches in the EPD database but NO sequence searches (like BLAST). BUT you can download the promoter sequences into a simple FASTA file.
NOTE that you can BLAST the EPD at Embnet.ch.
NOTE: Compare with PRESTA database !
FIE, which is part of the portal DRAGON Genome Explorer of the Institute for Infocomm Research (I2R), Singapore, is a tool to retrieve the region upstream and/or downstream of the 'start of exon 1' (Transcription Start Site, TSS) for a particular gene. This user-specified region requires the LocusLink ID or Gene/Protein Name and Organism Type as well as the Upstream and Downstream length with respect to the 'start of exon 1'. This reference position is determined by the longest annotated mRNA (RefSeq AND others). ESTs are not considered. NO batch retrieval option. Currently only available for human genes.
NOTE: Version 2.0 is considerably improved, as it lists all mRNA sequences (RefSeqs, which also include un- characterized potentially full-length mRNAs like 'DKFZ', 'KIAA', or 'FLJ') individually, so the user can decide which upstream region to extract.
GPD - Genomatix Promoter Database
(Genomatix Inc., Munich, Germany)
What is GPD?
The most complete collection of eukaryotic promoters available to date.
- The ONLY one containing promoters for alternative transcripts
High-quality promoter sequences for eukaryotic organisms.
Narrowed down to the REAL promoters rather than a 5kb 5' upstream guess.
Promoters for alternative transcripts, identified by Genomatix proprietary technology.
Available for Homo sapiens Mus musculus Rattus norvegicus Gallus gallus Pan troglodytes Canis familiaris Drosophila melanogaster Anopheles gambiae Plasmodium falciparum Oryza sativa Arabidopsis thalianacommercial microarrayscustomized gene lists
Annotation for transcription factor binding sites (by MatInspector)
Annotation for promoter modules (based on Genomatix proprietary libraries)
Who needs GPD?
GPD is the source of choice for researchers looking for functional promoter regions rather than mere 5' upstream sequences. GPD gives a precise annotation for designing experiments. In addition, GPD is a valuable tool for bioinformaticians performing large scale analysis of specific regulatory regions.
High quality promoter sequences are a prerequisite for any kind of transcriptional regulatory sequence analysis, such as:
transcriptional mechanisms of gene regulation
experimental design of functional promoter studies
construction of expression vectors
transcriptional network analysis
unveiling transcriptional events hidden in microarray data
If you only need small groups of promoter sequences have a look at Gene2Promoter.
Why is it unique?
GPD sets a new standard in promoter quality
"Promoter" sequences available from other sources are usually based on the 5' upstream regions of annotated genes. This is often misleading, since eukaryotic genes usually have 5' untranslated regions (5' UTRs). Since the 5' UTR may also be split over several exons the real regulatory region for a gene frequently is far away from the coding sequence (up to several kb). GPD contains the precise annotation of the promoter sequences.
More than 50% of all genes do have alternative transcripts. These genes, additional to alternative splicing, are frequently regulated by different promoters. Only GPD includes such alternative promoters.
All promoters are thoroughly annotated and validated according to highest scientific standards, using Genomatix proprietary technology (incl. PromoterInspector, mapping of oligo-capped cDNAs, comparative genomics).
Going far beyond the official annotation of genomic sequences, GPD contains:
Promoters mapped from 5' complete RNA sequences
Promoter annotation confirmed by PromoterInspector
Promoters derived from comparative genomics across species
Promoters supported by multiple evidence
Three possible quality levels assigned to each transcript associated with a promoter make it easy to assess promoter quality and accuracy
Only promoters derived from comparative genomics are not explicitly assigned to a transcript.
goldexperimentally verified 5' complete transcriptsilvertranscript with 5' end confirmed by PromoterInspector predictionbronzeannotated transcript, no confirmation for 5' completeness
GPD is available for
Entire genomes: Human, Mouse, Rat, Chicken, Drosophila, Anopheles, Plasmodium, Arabidopsis, Rice
Commercial microarrays: e.g. Affymetrix HG-U133A, ClonTech Atlas Human Apoptosis...
Customized Gene Lists
Excel readable file with promoter/gene information
Promoter sequences in FASTA or GenBank format, -500bp/+100bp
Excel readable file with transcription factor binding sites and their position plus description of all transcription factor binding sites and families
Excel readable file with promoter modules and their position plus description of all promoter modules
(Genomatix Inc., Munich, Germany)
Gene2Promoter is part of the commercial Genomatix suite of products. Gene2Promoter allows for automated extraction of groups of promoters from a list of accession numbers or gene IDs. Gene2Promoter is an optional module for ElDorado. If you need large scale extraction of promoter sequences, please have a look at GPD (the Genomatix Promoter Database).
You can query Gene2Promoter using human DNA accession numbers, like genomic sequences including a potential promoter region, but also with cDNAs like RefSeq accessions. It is NOT possible to use copy/paste sequences as query.
NOTE that you only have 5 free runs (with at most 5 accession numbers each) per month !!! Genomatix has termed the free academic access "evaluation account". Note that there is not only a limitation in the number of analyses but also in the functionality of the obtained data !
Gene2Promoter allows for automated extraction of groups of promoters from a list of accession numbers or gene IDs. Gene2Promoter is an optional module for ElDorado. (If you need large scale extraction of promoter sequences, please have a look at GPD (the Genomatix Promoter Database).
Easy checkbox selection of identified, ranked promoters from a list of matches, sortable by molecular function, biological process, or cellular component.
Immediate interactive graphical analysis for common transcription factor binding sites.
Full ElDorado analysis at one click for each promoter and its gene.
Direct submission of selected promoters to GEMS FrameWorker analysis for common motifs of TF binding sites.
PRESTA - PRomoter EST Association
(Academy of Sciences of the Czech Republic)
PRESTA is a tool/database that combines EST databases and putative GenBank/EMBL promoters to yield datasets of predicted promoters at high accuracy. PRESTA was developed at the Institute of Entomology, Academy of Sciences of the Czech Republic.
A high stringeny BLAST-search reveals ESTs that assist in transcription start-site verification.
In principle, PRESTA would therefore be useful for promoter verification by mapping EST 5' ends. BUT: Limited query options (NO LocusLink IDs, NO RefSeq IDs etc.), NO batch query, NO user-definition of region to extract, many genes simply NOT included. Solely based on ESTs, RefSeqs are not considered.
PRESTA can be either used as a 1) standalone Windows programme or as a 2) searchable public database, described under the section "About".
2a) Searchable public database: The PRESTA algorithm has been used on the complete sets of human and mouse promoters to extract databases of curated promoters. Subsets of these databases can be extracted via EST quality parameters, via tissue origin, or via gene name, GB accession numbers or EST accessions.
2b) Download PRESTA databases: You can download the complete sets of human and mouse promoters into simple FASTA-text files. NOTE that these entries only comprise the "pure" upstream 5'-flanking regions and not (what PRESTA calls) the downstream sequence tags (which are the first bases of the transcribed sequence). In addition, the entries just have one number or word in the definition line (refering to the "LOCUS" of an entry), so if you want to known more about one entry you may:
- Search with this term within PRESTA, at the "alternate query" page under "Gene GenBank/EMBL Acc" to retrieve the full database entry, including the downstream sequence tag and the list of relevant 5' ESTs.
- Search at NCBI-ENTREZ, under "Search Nucleotide for:"
NOTE: Compare with EPD - Eukaryotic Promoter Database !
(ZLAB, Boston University)
PromoSer is a service for promoter extraction for human, mouse, and rat genes provided as part of the Gene Regulation Tools of the Zlab, which belongs to the Boston University Bioinformatics.
PromoSer comes with a compact, but very instructive Help-file describing all the different options, making PromoSer one of the best tools for this purpose.
- You can use lists of GenBank accession numbers as input (RefSeqs, mRNAs, and ESTs). There is no option to use e.g. Affymetrix IDs.
- Define the region upstream and downstream of the TSS (Transcription Start Site) which you want to extract.
- Choose the "Quality" and the "Support" levels. The TSS "Quality" is a rating system (between 0 and 4) which describes the composition of the sequences that support this TSS (described in the Help-file).
- Extraction of alternative promoters: This is in fact a great feature allowing the user to select which of the mRNA sequences to define as reference for the location of the TSS. The option "only the one that is best supported and is 5' most" defines the TSS at the position which is best supported by RefSeq, mRNAs and ESTs. Otherwise, you may choose to extract only the promoter that starts 5' most (most aggressive extension). In the case of the presence of ESTs containing "5'-upstream first exons" as compared to the RefSeq, a totally different promoter may be extracted. The option "ignore all extension info and return the immedite upstream region" extracts the 5'-flanking genomic region relative to the supplied accession number, meaning that also single ESTs can be defined as reference point for the promoter definition.
- Result table: First, the extracted sequences are presented in the form of a table which is highly instructive as it lists the exact genomic positions, chromosome number, the quality level, the number of supporting sequences, and the "genomic extension", which means the amount of genomic sequence added at 5' (positive value) relative to the accession number provided. In case that the promoter is extracted at a downstream (3') position, a negative value is indicated.
- FASTA sequences: Finally, the promoter sequences can also be displayed (copied) as a FASTA sequence file, and thereby be transfered to other applications (like e.g. TOUCAN).
The Transcript Sequence Retreiver (TRASER) provides rapid retrieval of transcript and upstream (putative promoter -containing) sequences for predicted human genome mRNAs. The underlying database is built using the human genome annotation files provided by the NCBI.
The program accepts ONLY LocusLink IDs as input but allows batch-submission ! You can choose the length of sequence to retrieve.
NOTE that the database is solely based on RefSeq sequences, but is able to retrieve more than one upstream region for a gene in cases where several RefSeqs exist.
NOTE that the output sequences follow the UPPER/lower case model for EXON1/upstream sequences.
NOTE that there are 2 output formats, as FASTA sequence file, or as tab-delimited text (making it possible to e.g. paste the sequences into an EXCEL sheet of pre-existing data !).
A Database of Plant Promoter Sequences. Release 2002.01 PlantProm DB is an annotated, non-redundant collection of proximal promoter sequences for RNA polymerase II with experimentally determined transcription start site(s), TSS, from various plant species. It was developed by the Department of Computer Science at Royal Holloway, University of London, in collaboration with Softberry Inc. (USA) and is available also at www.softberry.com. The current release of PlantProm DB contains 305 entries including 71, 220 and 14 promoters from monocot, dicot and other plants, respectively. For collecting plant gene promoters the following criteria was followed.
· There is experimental evidence of the TSS position(s) of the gene, published in the literature. For genes with multiple TSSs the nearest to the CDS start position is taken, if no additional information on the predominance of one of them is available (positions of other TSSs are given in the name line of the sequence written in the FASTA format.
· The length of known promoter sequence upstream of chosen TSS is 200 bp or more; all stored promoter sequences are the same length, 251 bp, where the position 201 corresponds to the TSS, i.e. collected sequences occupy the region [-200 : +51], with the TSS in the position +1, and, thus, present proximal promoters mentioned above.
· An entry corresponds to the gene mapped on the genomic sequences
· Various alleles of a gene are presented in the database by a single entry.
· Genes with more than one non-allelic copy in the genome as well as paralogous genes are taken as different entries.
PlantProm DB provides the following information.
1. DNA sequence of 305 promoter regions [-200:+51], with TSS on the fixed
position +201, from various plant species, in the FASTA format, including:
1.1. 71 promoters of monocots,
1.2. 220 promoters of dicots,
1.3. 14 promoters from other plants,
1.4. 175 TATA promoters, consisting of 41 monocot, 131 dicot and <A href="http://mendel.cs.rhul.ac.uk/pprom/PLPR_TATA_other.seq">3 other,
plant species sequences, respectively.
1.5. 130 TATA-less promoters, consisting of 30 monocot, 89 dicot
and 11 other plant species sequences, respectively.
2. Taxonomic and promoter type classification of promoters, including:
2.1. List of species represented in the PlantProm DB,
2.2. List of genes/gene products and promoter types represented in the PlantProm DB.
3. Nucleotide Frequency Matrices for canonical promoter elements
(TATA-box, CCAAT- box, and TSS-motif or Initiator element, Inr), including:
3.1 TATA-matrices for various promoter collections,
3.2 CCAAT-matrices for various promoter collections,
3.3 TSS-motif-matrices for various promoter collections
4. Location of TATA-boxes in some promoters collections mentioned above, including:
4.1. 171 unrelated promoters from various plant species,
4.2. 128 unrelated promoters from dicot plants,
5. Location of CCAAT-boxes in some promoters collections mentioned above, including:
5.1. 131 unrelated promoters of both (TATA and TATA-less) types from various
5.2. 71 unrelated TATA promoters from various plant species,
5.3. 60 unrelated TATA-less promoters from various plant species.
6. Location of TSS-motifs in some promoters collections mentioned above, including:
6.1. 70 unrelated promoters of both (TATA and TATA-less) types from monocot
6.2. 217 unrelated promoters of both (TATA and TATA-less) types from dicot plants,
6.3. 171 unrelated TATA promoters from various plant species,
6.4. 30 unrelated TATA-less promoters from various plant species.
7. Short description of the computation of nucleotide frequency matrices for
various promoter elements.
Reference: Ilham A.Shahmuradov, Alex J.Gammerman, John M. Hancock, Peter M.Bramley and Victor V.Solovyev. PlantProm: A Database of Plant Promoter Sequences. Nucl. Acids Res. (2002) (in publication).
The Promoter Database of Saccharomyces cerevisiae
<LI>Genes: Explore the promoter regions of ~6000 genes and ORFs in yeast genome
Provide information on genes with mapped regulatory regions
Annotate putative regulatory sites of all genes and ORFs
Locate intergenic regions
Retrieve sequence of the promoter region
<LI>Regulatory elements and transcriptional factors
Provide information on transcriptionally related genes
Matrix and Consensus sequence
Correlation between elements
Binding affinity and expression
Retrieve promoter sequences
Find ORFs from gene names
Search existing motifs
Search putative regulatory elements using matrix and consensus
Group genes according to function categories
Repetitive sequence analyzer (Java)
Motif distribution (Java)
K-tuple relative information
Multisequence alignment by GibbsDNA
<LI>Submit records to SCPD
Submit a gene record
Submit a consensus record
Submit a matrix record
Collection of binding affinity and expression data
CpGProD (CpG Island Promoter Detection)
Use of CpGProD
CpGProD is a program dedicated to the prediction of promoters associated with CpG islands (CGIs) in mammalian genomic sequences. CpGProD is available either via a web server, useful for a small dataset, or as a standalone application for a larger dataset (see below). You only need an entry sequence (or file) in FASTA format which has been masked by RepeatMasker.
Method of CpGProD
In vertebrate genomes, the CpG dinucleotides are present at about 25% of their expected frequency. This deficiency is due to the methylation of cytosine at CpG dinucleotides and the very high mutation rate of the methylated cytosines.
CGIs are stretches of DNA escaping methylation and exhibiting a high G+C content and CpG frequency relative to the bulk DNA (Bird, 1986). The CGIs are several hundreds base to several kilobase long and are dispersed throughout the genome. 50%-60% of the human genes exhibit a CGI over their Transcription Start Site (TSS) but all the CGIs are not associated with a TSS.
Some studies (Ioshikhes and Zhang, 2000; Ponger et al., 2001) have shown that the CGIs located over the TSS (start CGIs) are characterized by a particular structure compared to other CGIs (no-start CGIs): the start CGIs are longer and display a greater CpGo/e ratio and G+C level than no-start CGIs. CpGProD computes a score corresponding to the probability to be over the TSS (start-p value) from the length, the G+C content and the CpGo/e ratio of each CGI. Moreover, two compositional biases between the plus and the minus strand of the start CGIs (Lobry, 1996) were observed in the start CGIs (Ponger, unpublished data). The CGIs located over the plus strand exhibit an excess of T compared to A and an excess of G compared to C. On the contrary, the CGIs located over the minus strand exhibit a depletion of T compared to A and a depletion of G compared to C. These biases are estimated by using two parameters, the AT-skew and the GC-skew. CpGProD calculates these parameters to predict the strand of each potential promoter and the probability to be over this strand.
McPromoter MM:II -- The Markov Chain Promoter Prediction Server
Massachusetts Institute of Technology
A statistical tool for the prediction of transcription start sites in eukaryotic DNA
The Drosophila release 3 genome special issue of Genome Biology, including a genome-wide analysis of Drosophila core promoters and McPromoter predictions, has been published here!
McPromoter is a program aiming at the exact localization of eukaryotic RNA polymerase II transcription start sites. We offer specific models designed for either human/vertebrate or D. melanogaster DNA. This service is free for non-commercial use and restricted to sequences up to 20 kb. The results will be mailed to you, so providing an email address is *mandatory*. We are always thankful for feedback -- especially if you obtain useful results...
Read the Q & A site first for information about the recent changes and instructions how to use McPromoter!
(Also contains links to the McPromoter papers and Uwe's thesis.)
Sensitivity and specificity are now at about 40-50% for both vertebrate and Drosophila at the transcript level .
Hint: If you don't get a prediction, have a look at the graphics that comes with the email to find optima that might have been missed by the pre-specified sensitivity level. Note for Drosophila: If you want to retrieve predictions for a large fly region, use the GadFly analysis interface. Type in a chromosome arm under "Query sequence" (either 2L, 2R, 3L, 3R, 4, or X), and choose "promoter" as Program. This will give you all predictions in this arm with scores over 0.85, without any filtering for repeats or low-complexity. For specific regions, these predictions also show up in the FlyBase Genome Browser, if you choose "Dump view as TABLE" in the "display settings".
Cold Spring Harbor Laboratory
Mammalian Promoter Database (Version 2.0)
Homo sapiens Promoter Database (HsPD
Version 2.35 -- Assembly of May 2004 Mus musculus Promoter Database (MmPD
Version 2.05 -- Assembly of May 2004 Rattus norvegicus Promoter database (RnPD
Version 2.03 -- Assembly of Jun. 2003
<Transcription Regulatory Element Database (TRED