https://lists.soe.ucsc.edu/pipermail/genome/2001-June/000451.html
[Genome] Adding tigr gene index annotation
tom tom at cyber-dyne.com
Fri Jun 1 16:33:58 PDT 2001
Previous message: [Genome] UCSC/Ensembl differences
Next message: [Genome] Expected date of next freeze
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
>consult http://www.cse.ucsc.edu/~kent/gbd.html for track submission format >Also, could you send me a URL for Gene Index? Sure http://www.tigr.org/tdb/tgi.shtml "Integrating data from international EST sequencing and gene research projects, the Gene Indices are an analysis of the transcribed sequences represented in the world's public EST data." Update version and dates: Cattle 4.0 3-24-01 Human 6.0 6-30-00 Mouse 5.0 10-11-00 Porcine 2.0 4-17-01 Rat 5.0 1-26-01 Xenopus laevis 1.1 3-15-01 Zebrafish 6.1 3-15-01 Current Release - Version 6.0, Released June 30, 2000 nt Nucleotide or Protein Sequence Identifier (EST, HT, THC, GB) Tissue, cDNA Library Name or cDNA Library Identifier(cat#) Gene Product Name (Example: insulin) Search by Radiation Hybrid Map Location Putative Identifications through The Nature Genome Directory Data Availability Total sequences in THCs singletons total ESTs 1,448,166 297,466 1,745,632 HTs 51,452 6,648 58,100 Totals 1,499,618 304,114 1,803,732 Total unique sequences THCs 83,892 singleton HTs 6,648 singleton ESTs 297,466 Total 388,006 HGI data is available free of charge only to researchers at non-profit institutions using it for non-commercial purposes. Please goto our licensing agreement and follow the instructions there to obtain the HGI data files. If you represent a for-profit organization, please contact us by email for details on how to obtain a commercial license for any of the data files described below. Please read the copyright notice governing use of this data. Data Files An index file containing the complete, minimally redundant Human Gene Index. A fasta file, containing the complete set of THC sequences in the Index with previous THC identities in the definition line. A file containing the THC id's and the ESTs that comprise them. Data Definitions and Protocols Frequently Asked Questions TIGR Gene Index Publications Send mail to www at tigr.org for WWW specific Comments/Questions . Send mail to hgi at tigr.org for HGI Comments/Questions . TCs: Tentative Consensus sequences are created by assembling ESTs into virtual transcripts. In some cases, TCs contain full or partial cDNA sequences (ETs) obtained by classical methods. TCs contain information on the source library and abundance of ESTs and in many cases represent full-length transcripts. Alternative splice forms are built into separate TCs. TCs are actual assemblies, with a consensus sequence, and not simply clusters of overlapping sequences. Example TC. ESTs: Expressed Sequence Tags are partial, single-pass sequences from either end of a cDNA clone. The EST strategy was developed to allow rapid identification of expressed genes by sequence analysis. ETs: TIGR's Expressed Gene Anatomy Database (EGAD) contains a non-redundant set of nucleotide sequences that represent mature transcripts (ETs). The ETs are curated for nomenclature and cellular roles, and links have been made to related accessions. Sequences were either loaded directly from GenBank (cDNAs) or were derived from genomic sequences when cDNAs were not available. Where available, 5' and 3' non-coding regions were included. Alternative splice forms of genes are explicitly represented. Singleton ESTs: Also refered to as singletons, are ESTs that are not contained in an assembly. These ESTs went through the assembly process but did not meet the match criteria (see below) to be assembled with any other EST in the collection of ESTs and other GenBank sequences used to create the consensus sequences for a particular Gene Index. Protocol for Assembly of ESTs and Transcripts Preparation of EST data Sequences were extracted from dbEST and were subjected to quality control screening (vector, E. coli, polyA, T, or CT removal, minimum length = 100 bp, < 3% N). Preparation of non-redundant transcript (ET) database All sequences from the appropriate division of GenBank were extracted. Non-coding sequences were discarded and cDNAs and coding sequences from genomic entries were saved. Redundant entries for the same gene were removed, retaining link to accession number. Sequences and related information are stored in TIGR's Expressed Gene Anatomy Database (EGAD). The curated ET data set is available as a multiple FastA format file. See the EGAD main page for more information. Assembly Cleaned EST sequences and non-redundant transcript (ET) sequences were combined. Using the CAP3 Sequence Assembly Program sequences were assembled into contigs. TCs are consensus sequences based on two or more ESTs (and possibly an ET) that overlap for at least 40 bases with at least 95% sequence identity. These strict criteria help minimize the creation of chimeric contigs. These contigs are assigned a TC (Tentative Consensus) number. TCs may comprise ESTs derived from different tissues. The best hits for TC's were assigned by searching the TC set against a non-redundant amino acid database(nraa) using a search method developed in house called DNA-Protein Search, DPS (Microbial and Comparative Genomics, Vol 1, Number 4 1996). The top five hits based on score (cutoff value of 350) were selected and displayed for each TC. Caveats TCs are only as good as the ESTs underlying them; there may be unspliced or chimeric ESTs and thus TCs There is still redundancy in the TC set because sequences must match end to end and at a certain percent identity to be combined Directionality of the TCs should not be assumed Not all TCs contain protein-coding regions
评论