登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

云之南

风声,雨声,读书声,声声入耳;家事,国事,天下事,事事关心

 
 
 

日志

 
 
关于我

专业背景:计算机科学 研究方向与兴趣: JavaEE-Web软件开发, 生物信息学, 数据挖掘与机器学习, 智能信息系统 目前工作: 基因组, 转录组, NGS高通量数据分析, 生物数据挖掘, 植物系统发育和比较进化基因组学

gene index assembly protocol  

2011-04-25 14:04:55|  分类: 生物信息学 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

https://lists.soe.ucsc.edu/pipermail/genome/2001-June/000451.html

[Genome] Adding tigr gene index annotation
tom tom at cyber-dyne.com
Fri Jun 1 16:33:58 PDT 2001


Previous message: [Genome] UCSC/Ensembl differences
Next message: [Genome] Expected date of next freeze
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]


>consult  http://www.cse.ucsc.edu/~kent/gbd.html for track submission format    >Also, could you send me a URL for Gene Index?    Sure   http://www.tigr.org/tdb/tgi.shtml  "Integrating data from international EST sequencing and gene research  projects, the Gene  Indices are an analysis of the transcribed sequences  represented in the world's public  EST data."    Update version and dates:    Cattle  4.0 3-24-01  Human  6.0 6-30-00  Mouse  5.0 10-11-00  Porcine  2.0 4-17-01  Rat  5.0 1-26-01  Xenopus laevis  1.1 3-15-01  Zebrafish  6.1 3-15-01    Current Release - Version 6.0, Released June 30, 2000  nt    Nucleotide or Protein Sequence Identifier (EST, HT, THC, GB) Tissue, cDNA  Library Name or cDNA Library Identifier(cat#) Gene Product Name (Example:  insulin) Search by Radiation Hybrid Map Location Putative Identifications  through The Nature Genome Directory    Data Availability    Total sequences in THCs singletons total ESTs 1,448,166   297,466  1,745,632 HTs 51,452 6,648  58,100   Totals 1,499,618 304,114  1,803,732     Total unique sequences THCs 83,892 singleton HTs 6,648 singleton  ESTs 297,466 Total 388,006    HGI data is available free of charge only to researchers at non-profit  institutions using it for non-commercial purposes. Please goto our  licensing agreement and follow the instructions there to obtain the HGI  data files. If you represent a for-profit organization, please contact us  by email for details on how to obtain a commercial license for any of the  data files described below. Please read the copyright notice governing use  of this data. Data Files An index file containing the complete, minimally  redundant Human Gene Index. A fasta file, containing the complete set of  THC sequences in the Index with previous THC identities in the definition  line. A file containing the THC id's and the ESTs that comprise them.      Data Definitions and Protocols Frequently Asked Questions TIGR Gene Index  Publications Send mail to www at tigr.org for WWW specific Comments/Questions  . Send mail to hgi at tigr.org for HGI Comments/Questions .      TCs: Tentative Consensus sequences are created by assembling ESTs into  virtual transcripts. In some cases, TCs contain full or partial cDNA  sequences (ETs) obtained by classical methods. TCs contain information on  the source library and abundance of ESTs and in many cases represent  full-length transcripts. Alternative splice forms are built into separate  TCs. TCs are actual assemblies, with a consensus sequence, and not simply  clusters of overlapping sequences. Example TC.    ESTs: Expressed Sequence Tags are partial, single-pass sequences from  either end of a cDNA clone. The EST strategy was developed to allow rapid  identification of expressed genes by sequence analysis.    ETs: TIGR's Expressed Gene Anatomy Database (EGAD) contains a non-redundant  set of nucleotide sequences that represent mature transcripts (ETs). The  ETs are curated for nomenclature and cellular roles, and links have been  made to related accessions. Sequences were either loaded directly from  GenBank (cDNAs) or were derived from genomic sequences when cDNAs were not  available. Where available, 5' and 3' non-coding regions were included.  Alternative splice forms of genes are explicitly represented.    Singleton ESTs: Also refered to as singletons, are ESTs that are not  contained in an assembly. These ESTs went through the assembly process but  did not meet the match criteria (see below) to be assembled with any other  EST in the collection of ESTs and other GenBank sequences used to create  the consensus sequences for a particular Gene Index.    Protocol for Assembly of ESTs and Transcripts    Preparation of EST data Sequences were extracted from dbEST and were  subjected to quality control screening (vector, E. coli, polyA, T, or CT  removal, minimum length = 100 bp, < 3% N).    Preparation of non-redundant transcript (ET) database    All sequences from the appropriate division of GenBank were extracted.  Non-coding sequences were discarded and cDNAs and coding sequences from  genomic entries were saved. Redundant entries for the same gene were  removed, retaining link to accession number. Sequences and related  information are stored in TIGR's Expressed Gene Anatomy Database (EGAD).  The curated ET data set is available as a multiple FastA format file. See  the EGAD main page for more information.    Assembly    Cleaned EST sequences and non-redundant transcript (ET) sequences were  combined. Using the CAP3 Sequence Assembly Program sequences were assembled  into contigs. TCs are consensus sequences based on two or more ESTs (and  possibly an ET) that overlap for at least 40 bases with at least 95%  sequence identity. These strict criteria help minimize the creation of  chimeric contigs. These contigs are assigned a TC (Tentative Consensus)  number. TCs may comprise ESTs derived from different tissues. The best hits  for TC's were assigned by searching the TC set against a non-redundant  amino acid database(nraa) using a search method developed in house called  DNA-Protein Search, DPS (Microbial and Comparative Genomics, Vol 1, Number  4 1996). The top five hits based on score (cutoff value of 350) were  selected and displayed for each TC.    Caveats    TCs are only as good as the ESTs underlying them; there may be unspliced or  chimeric ESTs and thus TCs There is still redundancy in the TC set because  sequences must match end to end and at a certain percent identity to be  combined Directionality of the TCs should not be assumed Not all TCs  contain protein-coding regions    
  评论这张
 
阅读(1667)| 评论(0)

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2018