注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

云之南

风声,雨声,读书声,声声入耳;家事,国事,天下事,事事关心

 
 
 

日志

 
 
关于我

专业背景:计算机科学 研究方向与兴趣: JavaEE-Web软件开发, 生物信息学, 数据挖掘与机器学习, 智能信息系统 目前工作: 基因组, 转录组, NGS高通量数据分析, 生物数据挖掘, 植物系统发育和比较进化基因组学

网易考拉推荐

The New K-mer Counter and a Genome Assembly  

2014-07-15 14:52:43|  分类: 生信分析软件 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

Now that Sebastian Deorowicz et al. published a damn good k-mer counter, let us put it to some real work. The first half of this two-part series will present some benchmarks on the program, and the second half will discuss our use on a real-life genome assembly problem.

Installation

The installation of kmc 2 is fairly straightforward. We went to their website, downloaded the source code, compiled and installed the program. The compilation needs Boost library, and if you have it properly installed, everything else should go well to give you two programs – kmc and kmc_dump. They also have pre-compiled binaries, but we downloaded the source code to be able to poke around in the near future.

Description of our project

We have two sets of Illumina PE reads from a vertebrate species that swims. The first set (set1) has 1,588,478,432 100nt reads and the second set (set2) has 1,376,496,476 100nt reads. We did k-mer counting in set1, set2 and set1+set2 for k=21, 27, 33, 39 and 45. We also did k-mer counting for set1 at k=51, set2 at k=51 and set1+set2 at k=63. Yes, that is a lot of k-mer counting.

This was done in a 32-core AMD server with 64-bit Linux, non-SSD hard-drive and 512 GB RAM. The amount of RAM is somewhat irrelevant for this highly efficient program.

Time

I started the k-mer counting tasks around midnight and got all results back by ~7AM. Wow! They were run serially, while the server was also running twenty blastp jobs from another user.

set1 21mer counting took 585.523s.
set2 21mer counting took 567.74s.
set1+set2 21mer counting took 1507.44s.

set1 27mer counting took 612.069s.
set2 27mer counting took 534.832s.
set1+set2 27mer counting took 1361.61s.

set1 33mer counting took 1197.23s.
set1 33mer counting took 1037.95s.
set1 33mer counting took 2440.34s.

We noticed a jump in time needed for k>32, but no big differences after that for k=33, 39, 45 and 51.

Here is the output file of first k-mer counting job (set1 for k=21) so that you have an idea about all temporary space requirements.

1st stage: 258.373s
2nd stage: 327.15s
Total : 585.523s
Tmp size : 129830MB

Stats:
No. of k-mers below min. threshold : 3342104765
No. of k-mers above max. threshold : 0
No. of unique k-mers : 4734755445
No. of unique counted k-mers : 1392650680
Total no. of k-mers : 126120517209
Total no. of reads : 1588478432
Total no. of super-k-mers : 15387969776
1st stage: 252.294s
2nd stage: 315.446s
Total : 567.74s
Tmp size : 112898MB

Command to Run and Output Files

The command kmc can be run either on a single fasta/fastq file or multiple files. The first two commands below are for single files and the third one is for a file named ‘list’, which lists both fasta files.

./kmc -k21 -m200 -fa s200-data.fa out21-200 tmp
./kmc -k21 -m200 -fa s300-data.fa out21-300 tmp
./kmc -k21 -m200 -fa @list out21-both tmp

The program kmc produces two binary output files (out21-200.kmc_pre – 1.1G and out21-200.kmc_suf 5.2G). Interestingly, in all k-mer counting examples we ran, the size of the ‘pre’ file varied between 1073807452 bytes and 67174492 bytes, and we do not have much clue why. This is where the code will come handy. One clue – the file size appears to be k-dependent.

The program kmc_dump takes those two binary files, and gives you the k-mer counts in two-column text format. Command to run -

./kmc_dump -cx0 out21-200 o

If the above commentary appears rather drab, the next commentary will surely add some color to it.


In the last commentary, we have shown a few examples of doing k-mer counting with the new super-efficient program – kmc-2. This commentary will provide the context. Readers unfamiliar with how to interpret k-mer distributions of short reads are also encouraged to go through an old blog post – Maximizing Utility of Available RAMs in K-mer World.

Here is our story. We started assembling two sets of Illumina libraries discussed in previous commentary using Minia and noticed something odd. At first, we assembled the set2 (see previous commentary for reference) at k=63 using Minia. Total size of all assembled contigs came to be around 700Mb with N50=982. So far, so good.

Another bioinformatician helped us assemble set1 and set2 together with Minia. Since the depth of sequencing was very high, we chose to do this assembly at k=73. The total assembly size came to be roughly the same, but, quite puzzlingly, N50 dropped to 219 !

To figure out what was going on, we took set2 and did k-mer counting at k=63. The distribution came out to be usual (unimodal with peak around 23, ignoring the noise peak near count=1). As explained in the earlier commentary, the location of the mode of the distribition is attributable to the depth of sequencing and the amount of repeats in the genome. The peak at 23 was somewhat low given how deeply this genome was sequenced (~50 was expected), but we also knew that this organism had extensive duplication. Therefore, there was nothing odd with the k-mer distribution as such.

—————————————————–

Armed with the new k-mer counter, we started to do forensic analysis on the libraries and here is what we found. The following three figures show k-mer distributions for k=21. Especially pay attention to the third chart below.

first set, k=21

set21-200

second set, k=21

set21-300

both sets, k=21

set21-both

The above chart showing k-mer distribution of the combined libraries has two distinct peaks and one at the right location (~120) derived from the coverage, whereas the other one is at half of that number. The above pattern very likely indicates high degree of heterozygosity in our sample. Two chromosomes in diploid genomes are usually very similar and therefore total k-mer coverage is unimodal. When the difference between the chromosomes start to increase, a second peak appears at location that is half of the primary peak.

We first thought that the two PE libraries (set1 and set2) were prepared with samples from two different animals, but our collaborator confirmed that they were from the same animal. In fact, k-mer distribution from each set also shows two separate peaks, even though those peaks are somewhat less pronounced than when the sets are combined together.

————————————————————–

When we plotted the distribution at increasing k-values, quite an interesting picture emerged. First check the following figures and then see the bottom of the post for interpretation.

first set, k=27

set27-200

second set, k=27

set27-300

both sets, k=27

set27-both

————————————————————–

first set, k=33

set33-200

second set, k=33

set33-300

both sets, k=33

set33-both

————————————————————–

first set, k=39

set39-200

second set, k=39

set39-300

both sets, k=39

set39-both

————————————————————–

first set, k=45

set45-200

second set, k=45

set45-300

both sets, k=45

set45-both

————————————————————–

first set, k=51
set51-200

second set, k=51
set51-300

————————————————————–

both sets, k=63

set63-both

————————————————————–

What is going on? We suspect the differences between the two chromosomes are in many short patches and are present all over the genome. Also, we can guess that their relative distances are around 40 (27 + 27/2). So, when we take k-mer distribution at k=21, a large number of k-mers from two chromosomes appear to be identical, thus making the peak at higher coverage taller than the peak at lower coverage. As k is increased from 21 to 27, 33, 39, 45, 51 and finally 63, two chromosomes start to differ everywhere, thus leaving with very few identical k-mers from both copies of the chromosomes. Therefore, the peak at higher value disappears at k=63.

That would certainly be very odd and these charts will keep us awake all night.

Other interpretations are welcome ! Also please let me know, if the descriptions provided above of what we are doing are not sufficient.


http://www.homolog.us/blogs/blog/2014/07/12/the-new-k-mer-counter-and-a-genome-assembly-i/

http://www.homolog.us/blogs/blog/2014/07/12/the-new-k-mer-counter-and-a-genome-assembly-ii/
  评论这张
 
阅读(878)| 评论(0)
推荐 转载

历史上的今天

在LOFTER的更多文章

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2016