登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

云之南

风声,雨声,读书声,声声入耳;家事,国事,天下事,事事关心

 
 
 

日志

 
 
关于我

专业背景:计算机科学 研究方向与兴趣: JavaEE-Web软件开发, 生物信息学, 数据挖掘与机器学习, 智能信息系统 目前工作: 基因组, 转录组, NGS高通量数据分析, 生物数据挖掘, 植物系统发育和比较进化基因组学

N50 vs L 50 Statistic (Genome assembly)  

2010-04-27 21:14:20|  分类: 生物信息学 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

http://en.wikipedia.org/wiki/N50_Statistic_%28Genome_assembly%29

http://www.zer00ne.com/category/statistics/

N50 Statistic (Genome assembly)

From Wikipedia, the free encyclopedia

Jump to: navigation, search

This article is about N50 Statistic (Genome assembly). For other uses, see N50.

N50 is a statistical measure of average length of a set of sequences. It is used widely in genomics, especially in reference to contig lengths within a draft assembly.

Definition

N50 is defined as the contig length such that using equal or longer contigs produces half the bases of the genome. The N50 size is computed by sorting all contigs from largest to smallest and by determining the minimum set of contigs whose sizes total 50% of the entire genome.

The N50 size is the length such that 50% of the assembled genome lies in blocks of the N50 size or longer.

Alternative definition: Given a set of sequences of varying lengths, the N50 length is defined as the length N for which 50% of all bases in the sequences are in a sequence of length L < N.

References

Retrieved from "http://en.wikipedia.org/wiki/N50_Statistic_(Genome_assembly)"


The N50 contig size is a weighted median value and defined as
the length of the smallest contig S in the sorted list of all
contigs where the cumulative length from the largest contig to
contig S is at least 50% of the total length


The N90 contig size is a weighted median value and defined as
the length of the smallest contig S in the sorted list of all
contigs where the cumulative length from the largest contig to
contig S is at least 90% of the total length

http://seqanswers.com/forums/showthread.php?t=2766


How to calculate the N50 contig size?

Posted on the January 25th, 2010 under Statistics by admin

1.) N50 is a statistical measure of average length of a set of sequences. It is used widely in genomics, especially in reference to contig or supercontig lengths within a draft assembly.

http://www.broadinstitute.org/crd/wiki/index.php/N50

2.) The N50 size of a set of entities (e.g., contigs or scaffolds) represents the largest entity E such that at least half of the total size of the entities is contained in entities larger than E. For example if we have a collection of contigs with sizes 7, 4, 3, 2, 2, 1, and 1 kb (total size = 20kbp), the N50 length is 4 because we can cover 10 kb with contigs bigger than 4kb.

http://www.cbcb.umd.edu/research/castats.shtml

3.) N50 contig size is the value X such that at least half of the genome is contained in contigs of size X or larger. N50 contig count is the number of contigs of size X or larger

Zimin et al. Genome Biology 2009 10:R42 doi:10.1186/gb-2009-10-4-r42

4.)  N50 size is a very useful statistic for comparing genome assemblies

 

5.) N50 length is the length of the shortest contig such that the sum of contigs of equal length or longer is at least 50% of the total length of all contigs
For example’s sake lets say an assembler has created contigs of the following length (in descending order):

91 77 70 69 62 56 45 29 16 4

The sum of these is 519bp, so the sum of all contigs equal to or greater than N50 must be equal to or greater than 519/2 or 259.5
We can see by brute force that
91+77=168
91+77+70=238
91+77+70+69=307 (that’ll do)
so the N50 for this assembly is 69 bp

Another way to look at this: at least half the nucleotides in this assembly belong to contigs of size 69 bp or longer.

http://jermdemo.blogspot.com/2008/11/calculating-n50-from-velvet-output.html

6.) N50 length : the largest length such that 50% of all base-pairs are contained in contigs of this length or larger

http://www.sciencemag.org/cgi/content/full/291/5507/1257#ref1

7.) N50 is the contig size such that the contigs larger than that have 50% the bases of the assembly.

http://www.genome.umd.edu/reconciliation.htm



L50

http://www.acgt.me/blog/2015/6/11/l50-vs-n50-thats-another-fine-mess-that-bioinformatics-got-us-into

https://en.wikipedia.org/wiki/N50,_L50,_and_related_statistics

N50 is a statistic that is widely used to describe genome assemblies. It describes an average length of a set of sequences, but the average is not the mean or median length. Rather it is the length of the sequence that takes the sum length of all sequences — when summing from longest to shortest — past 50% of the total size of the assembly. The reasons for using N50, rather than the mean or median length, is something that I've written about before in detail.

The number of sequences evaluated at the point when the sum length exceeds 50% of the assembly size is sometimes referred to as the L50 number. Admittedly, this is somewhat confusing: N50 describes a sequence length whereas L50 describes a number of sequences. This oddity has led to many people inverting the usage of these terms. This doesn't help anyone and leads to confusion and to debate.

I believe that the aforementioned definition of N50 was first used in the 2001 publication of the human genome sequence:

We used a statistic called the ‘N50 length’, defined as the largest length L such that 50% of all nucleotides are contained in contigs of size at least L.

I also have a vague memory that other genome sequences — that were made available by the Sanger Institute around this time — also included statistics such as N60, N70, N80 etc. (at least I recall seeing these details in README files on an FTP site). Deanna also pointed out that the Celera Human Genome paper (published in Science, also in 2001) describes something that we might call N25 and N90, even though they didn't use these terms in the paper:

More than 90% of the genome is in scaffold assemblies of 100,000 bp or more, and 25% of the genome is in scaffolds of 10 million bp or large

I don't know when L50 first started being used to describe lengths, but I would bet it was after 2001. If I'm wrong, please comment below and maybe we can settle this once and for all. Without evidence for an earlier use of L50 to describe lengths, I think people should stick to the 2001 definition of N50 (which I would also argue is the most common definition in use today).

参考文献:
A beginner’s guide to eukaryotic genome annotation
  评论这张
 
阅读(2507)| 评论(0)

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2018