登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

云之南

风声,雨声,读书声,声声入耳;家事,国事,天下事,事事关心

 
 
 

日志

 
 
关于我

专业背景:计算机科学 研究方向与兴趣: JavaEE-Web软件开发, 生物信息学, 数据挖掘与机器学习, 智能信息系统 目前工作: 基因组, 转录组, NGS高通量数据分析, 生物数据挖掘, 植物系统发育和比较进化基因组学

生物信息学主要英文术语及释义   

2009-11-05 21:28:08|  分类: 生物信息学 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |
P value (P值/概率值)
The probability of an alignment occurring with the score in question or better. The p value is calculated by relating the observed alignment score, S, to the expected distribution of HSP scores from comparisons of random sequences of the same length and composition as the query to the database. The most highly significant P values will be those close to 0. P values and E values are different ways of representing the significance of the alignment.
Pair-wise sequence alignment(双序列联配)
An alignment performed between two sequences.
PAM (可接受突变百分率/可以观察到的突变百分率,它可作为一种进化时间单位)
Percent Accepted Mutation. A unit introduced by Dayhoff et al. to quantify the amount of evolutionary change in a protein sequence. 1.0 PAM unit, is the amount of evolution which will change, on average, 1% of amino acids in a protein sequence. A PAM(x) substitution matrix is a look-up table in which scores for each amino acid substitution have been calculated based on the frequency of that substitution in closely related proteins that have experienced a certain amount (x) of evolutionary divergence.
Paralogous (旁系同源)
Homologous sequences within a single species that arose by gene duplication. Genes that are related through gene duplication events. These events may lead to the production of a family of related proteins with similar biological functions within a species. Paralogous gene families within a species are identified by using an individual protein as a query in a database similarity search of the entireproteome of an organism. The process is repeated for the entire proteome and the resulting sets of related proteins are then searched for clusters that are most likely to have a conserved domain structure and should represent a paralogous gene family.
Parametric sequence alignment
An algorithm that finds a range of possible alignments based on varying the parameters of the scoring system for matches, mismatches, and gap penalties. An example is the Bayes block aligner.
PDB (主要蛋白质结构数据库之一)
Brookhaven Protein Data Bank. A database and format of files which describe the 3D structure of a protein or nucleic acid, as determined by X-ray crystallography or nuclear magnetic resonance (NMR) imaging. Themolecules described by the files are usually viewed locally by dedicated software, but can sometimes be visualised on the world wide web.
Pearson correlation coefficent(Pearson相关系数)
A measure of the correlation between two variables that reflects the degree to which the two variables are related. For example, the coefficient is used as a measure of similarity of gene expression in a microarray experiment. See also Correlation coefficient. Percent identity The percentage of the columns in an alignment of two sequences that includes identical amino acids. Columns in the alignment that include gaps are not scored in the calculation.
Percent similarity(相似百分率)
The percentage of the columns in an alignment of two sequences that includes either identical amino acids or amino acids that are frequently found substituted for each other in sequences of related proteins (conservative substitutions). These substitutions may be found in an amino acid substitution matrix such as the Dayhoff PAM and Henikoff BLOSUM matrices. Columns in the alignment that include gaps are not scored in the calculation.
Perceptron(感知器,模拟人类视神经控制系统的图形识别机)
A neural network in which input and output states are directly connected without intervening hidden layers.
PHRED (一种广泛应用的原始序列分析程序,可以对序列的各个碱基进行识别和质量评价)
A widely used computer program that analyses raw sequence to produce a 'base call' with an associated 'quality score' for each position in the sequence. A PHRED quality score of X corresponds to an error probability of approximately 10-X/10. Thus, a PHRED quality score of 30 corresponds to 99.9% accuracy for the base call in the raw read.
PHRAP (一种广泛应用的原始序列组装程序)
A widely used computer program that assembles raw sequence into sequence contigs and assigns to each position in the sequence an associated 'quality score', on the basis of the PHRED scores of the raw sequence reads. A PHRAP quality score of X corresponds to an error probability of approximately 10-X/10. Thus, a PHRAP quality score of 30 corresponds to 99.9% accuracy for a base in the assembled sequence.
Phylogenetic studies(系统发育研究)
PIR (主要蛋白质序列数据库之一,翻译自GenBank)
A database of translated GenBank nucleotide sequences. PIR is a redundant (see Redundancy) protein sequence database. The database is divided into four categories:
PIR1 - Classified and annotated.
PIR2 - Annotated.
PIR3 - Unverified.
PIR4 - Unencoded or untranslated.
Poisson distribution(帕松分布)
Used to predict the occurrence of infrequent events over a long period of timeor when there are a large number of trials. In sequence analysis, it is used to calculate the chance that one pair of a large number of pairs of unrelated sequences may give a high local alignment score.
Position-specific scoring matrix (PSSM)(特定位点记分矩阵,PSI-BLAST等搜索程序使用)
The PSSM gives the log-odds score for finding a particular matching amino acid in a target sequence. Represents the variation found in the columns of an alignment of a set of related sequences. Each subsequent matrix column corresponds to the next column in the alignment and each row corresponds to a particular sequence character (one of four bases in DNA sequences or 20 amino acids in protein sequences). Matrix values are log odds scores obtained by dividing the counts of the residue in the alignment, dividing by the expected number of counts based on sequence composition, and converting the ratio to a log score. The matrix is moved along sequences to find similar regions by adding the matching log odds scores and looking for high values. There is no allowance for gaps. Also called a weight matrix or scoring matrix.
Posterior (Bayesian analysis)
A conditional probability based on prior knowledge and newly evaluated relationships among variables using Bayes rule. See also Bayes rule.
Prior (Bayesian analysis)
The expected distribution of a variable based on previous data.
Profile(分布型)
A matrix representation of a conserved region in a multiple sequence alignment that allows for gaps in the alignment. The rows include scores for matching sequential columns of the alignment to a test sequence. The columns include substitution scores for amino acids and gap penalties. See also PSSM.
Profile hidden Markov model(分布型隐马尔可夫模型)
A hidden Markov model of a conserved region in a multiple sequence alignment that includes gaps and may be used to search new sequences for similarity to the aligned sequences.
Proteome(
蛋白质组
The entire collection of proteins that are encoded by the genome of an organism. Initially the proteome is estimated by gene prediction and annotation methods but eventually will be revised as more information on the sequence of the expressed genes is obtained.
Proteomics (
蛋白质组学)
Systematic analysis of protein expression of normal and diseased tissues that involves the separation, identification and characterization of all of the proteins in an organism.
Pseudocounts
Small number of counts that is added to the columns of a scoring matrix to increase the variability either to avoid zero counts or to add more variation than was found in the sequences used to produce the matrix.PSI-BLAST (BLAST系列程序之一)
Position-Specific Iterative BLAST. An iterative search using the BLAST algorithm. A profile is built after the initial search, which is then used in subsequent searches. The process may be repeated, if desired with new sequences found in each cycle used to refine the profile. Details can be found in this discussion of PSI-BLAST. (Altschul et al.)
PSSM (特定位点记分矩阵)
See position-specific scoring matrix and profile.
Public sequence databases (公共序列数据库,指GenBank、EMBL和DDBJ)
The three coordinated international sequence databases: GenBank, the EMBL data library and DDBJ.
Q20 (Quality score 20)
A quality score of > or = 20 indicates that there is less than a 1 in 100 chance that the base call is incorrect. These are consequently high-quality bases. Specifically, the quality value "q" assigned to a basecall is defined as:
q = -10 x log10(p)
where p is the estimated error probability for that basecall. Note that high quality values correspond to low error probabilities, and conversely.
Quality trimming
This is an algorithm which uses a sliding window of 50 bases and trims from the 5' end of the read followed by the 3' end. With each window, the number of low quality (10 or less) bases is determined. If more than 5 bases are below the threshold quality, the window is incremented by one base and the process is repeated. When the low quality test fails, the position where it stopped is recorded. The parameters for window length low quality threshold and number of low quality bases tolerated are fixed. The positions of the 5' and 3' boundaries of the quality region are noted in the plot of quality values presented in the" Chromatogram Details" report.
Query (待查序列/搜索序列)
The input sequence (or other type of search term) with which all of the entries in a database are to be compared.
Radiation hybrid (RH) map (辐射杂交图谱)
A genome map in which STSs are positioned relative to one another on the basis of the frequency with which they are separated by radiation-induced breaks. The frequency is assayed by analysing a panel of human–hamster hybrid cell lines, each produced by lethally irradiating human cells and fusing them with recipient hamster cells such that each carries a collection of human chromosomal fragments. The unit of distance is centirays (cR), denoting a 1% chance of a break occuring between two loci
Raw Score (初值,指最初得到的联配值S)
The score of an alignment, S, calculated as the sum of substitution and gap scores. Substitution scores are given by a look-up table (see PAM, BLOSUM). Gap scores are typically calculated as the sum of G, the gap opening penaltyand L, the gap extension penalty. For a gap of length n, the gap cost would be G+Ln. The choice of gap costs, G and L is empirical, but it is customary to choose a high value for G (10-15)and a low value for L (1-2).
Raw sequence (原始序列/读胶序列)
Individual unassembled sequence reads, produced by sequencing of clones containing DNA inserts.
Receiver operator characteristic
The receiver operator characteristic (ROC) curve describes the probability that a test will correctly declare the condition present against the probability that the test will declare the condition present when actually absent. This is shown through a graph of the tesls sensitivity against one minus the test specificity for different possible threshold values.
Redundancy (冗余)
The presence of more than one identical item represents redundancy. In bioinformatics, the term is used with reference to the sequences in a sequence database. If a database is described as being redundant, more than one identical (redundant) sequence may be found. If the database is said to be non-redundant (nr), the database managers have attempted to reduce the redundancy. The term is ambiguous with reference to genetics, and as such, the degree of non-redundancy varies according to the database manager's interpretation of the term. One can argue whether or not two alleles of a locus defines the limit of redundancy, or whether the same locus in different, closely related organisms constitutes redundency. Non-redundant databases are, in some ways, superior, but are less complete. These factors should be taken into consideration when selecting a database to search.
Regular expressions
This computational tool provides a method for expressing the variations found in a set of related sequences including a range of choices at one position, insertions, repeats, and so on. For example, these expressions are used to characterize variations found in protein domains in the PROSITE catalog.
Regularization
A set of techniques for reducing data overfitting when training a model. See also Overfitting.
Relational database(关系数据库)
Organizes information into tables where each column represents the fields of informa-tion that can be stored in a single record. Each row in the table corresponds to a single record. A single database can have many tables and a query language is used to access the data. See also Object-oriented database.
Scaffold (支架,由序列重叠群拼接而成)
The result of connecting contigs by linking information from paired-end reads from plasmids, paired-end reads from BACs, known messenger RNAs or other sources. The contigs in a scaffold are ordered and oriented with respect to one another.

Scoring matrix(记分矩阵)
See Position-specific scoring matrix.
SEG (一种蛋白质程序低复杂性区段过滤程序)
A program for filtering low complexity regions in amino acid sequences. Residues that have been masked are represented as "X" in an alignment. SEG filtering is performed by default in the blastp subroutine of BLAST 2.0. (Wootton and Federhen)
Selectivity (in database similarity searches)(数据库相似性搜索的选择准确性)
The ability of a search method to locate members of a protein family without making a false-positive classification of members of other families.
Sensitivity (in database similarity searches)(数据库相似性搜索的灵敏性)
The ability of a search method to locate as many members of a protein family as possi-ble, including distant members of limited sequence similarity.
Sequence Tagged Site (序列标签位点)
Short cDNA sequences of regions that have been physically mapped. STSs provide unique landmarks, or identifiers, throughout the genome. Useful as a framework for further sequencing.
Significance(显著水平)
A significant result is one that has not simply occurred by chance, and therefore is prob-ably true. Significance levels show how likely a result is due to chance, expressed as a probability. In sequence analysis, the significance of an alignment score may be calcu-lated as the chance that such a score would be found between random or unrelated sequences. See Expect value.
Similarity score (sequence alignment) (相似性值)
Similarity means the extent to which nucleotide or protein sequences are related. The extent of similarity between two sequences can be based on percent sequence identity and/or conservation. In BLAST similarity refers to a positive matrix score. The sum of the number of identical matches and conservative (high scoring) substitu-tions in a sequence alignment divided by the total number of aligned sequence charac-ters. Gaps are usually ignored.
Simulated annealing
A search algorithm that attempts to solve the problem of finding global extrema. The algorithm was inspired by the physical cooling process of metals and the freezing process in liquids where atoms slow down in movement and line up to form a crystal. The algorithm traverses the energy levels of a function, always accepting energy levels that are smaller than previous ones, but sometimes accepting energy levels that are greater, according to the Boltzmann probability distribution.
Single-linkage cluster analysis
An analysis of a group of related objects, e.g., similar proteins in different genomes to identify both close and more distantrelationships, represented on a tree or dendogram. The method joins the most closely related pairs by the neighbor-joining algorithm by representing these pairs as outer branches onthe tree. More distant objects are then pro-gressively added to lower tree branches. The method is also used to predict phylogenet-ic relationships by distance methods. See also Hierarchical clustering, Neighbor-joining method.
Smith-Waterman algorithm(Smith-Waterman算法)
Uses dynamic programming to find local alignments between sequences. The key fea-ture is that all negative scores calculated in the dynamic programming matrix are changed to zero in order to avoid extending poorly scoring alignments and to assist in identifying local alignments starting and stopping anywhere with the matrix.
SNP (单核苷酸多态性)
Single nucleotide polymorphism, or a single nucleotide position in the genome sequence for which two or more alternative alleles are present at appreciable frequency (traditionally, at least 1%) in the human population.
Space or time complexity(时间或空间复杂性)
An algorithms complexity is the maximum amount of computer memory or time required for the number of algorithmic steps to solve a problem.
Specificity (in database similarity searches)(数据库相似性搜索的特异性)
The ability of a search method to locate members of one protein family, including dis-tantly related members.
SSR (简单序列重复)
Simple sequence repeat, a sequence consisting largely of a tandem repeat of a specific k-mer (such as (CA)15). Many SSRs are polymorphic and have been widely used in genetic mapping.
Stochastic context-free grammar
A formal representation of groups of symbols in different parts of a sequence; i.e., not in the same context. An example is complementary regions in RNA that will form sec-ondary
structures. The stochastic feature introduces variability into such regions.
Stringency
Refers to the minimum number of matches required within a window. See also Filtering.
STS (序列标签位点的缩写)
See Sequence Tagged Site
Substitution (替换)
The presence of a non-identical amino acid at a given position in an alignment. If the aligned residues have similar physico-chemical properties the substitution is said to be "conservative".
Substitution Matrix (替换矩阵)
A substitution matrix containing values proportional to the probability that amino acid i mutates into amino acid j for all pairs of amino acids. such matrices are constructed by assembling a large and diverse sample of verified pairwise alignments of amino acids. If the sample is large enough to be statistically significant, the resulting matrices should reflect the true probabilities of mutations occuring through a period of evolution.Sum of pairs method
Sums the substitution scores of all possible pair-wise combinations of sequence charac-ters in one column of a multiple sequence alignment.
SWISS-PROT (主要蛋白质序列数据库之一)
A non-redundant (See Redundancy) protein sequence database. Thoroughly annotated and cross referenced. A subdivision is TrEMBL.
Synteny
The presence of a set of homologous genes in the same order on two genomes.
Threading
In protein structure prediction, the aligning of the sequence of a protein of unknown structure with a known three-dimensional structure to determine whether the amino acid sequence is spatially and chemically compatible with that structure.
TrEMBL (蛋白质数据库之一,翻译自EMBL)
A protein sequence database of Translated EMBL nucleotide sequences.
Uncertainty(不确定性)
From information theory, a logarithmic measure of the average number of choices that must be made for identification purposes. See also Information content.
Unified Modeling Language (UML)
A standard sanctioned by the Object Management Group that provides a formal nota-tion for describing object-oriented design.
UniGene (人类基因数据库之一)
Database of unique human genes, at NCBI. Entries are selected by near identical presence in GenBank and dbEST databases. The clusters of sequences produced are considered to represent a single gene.
Unitary Matrix (一元矩阵)
Also known as Identity Matrix. A scoring system in which only identical characters receive a positive score.
URL(统一资源定位符)
Uniform resource locator.
Viterbi algorithm
Calculates the optimal path of a sequence through a hidden Markov model of sequences using a dynamic programming algorithm.
Weight matrix
See Position-specific scoring matrix.
  评论这张
 
阅读(1559)| 评论(0)

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2018