Now that Sebastian Deorowicz et al. published a damn good k-mer counter, let us put it to some real work. The first half of this two-part series will present some benchmarks on the program, and the second half will discuss our use on a real-life genome assembly problem.
The installation of kmc 2 is fairly straightforward. We went to their website, downloaded the source code, compiled and installed the program. The compilation needs Boost library, and if you have it properly installed, everything else should go well to give you two programs – kmc and kmc_dump. They also have pre-compiled binaries, but we downloaded the source code to be able to poke around in the near future.
Description of our project
We have two sets of Illumina PE reads from a vertebrate species that swims. The first set (set1) has 1,588,478,432 100nt reads and the second set (set2) has 1,376,496,476 100nt reads. We did k-mer counting in set1, set2 and set1+set2 for k=21, 27, 33, 39 and 45. We also did k-mer counting for set1 at k=51, set2 at k=51 and set1+set2 at k=63. Yes, that is a lot of k-mer counting.
This was done in a 32-core AMD server with 64-bit Linux, non-SSD hard-drive and 512 GB RAM. The amount of RAM is somewhat irrelevant for this highly efficient program.
I started the k-mer counting tasks around midnight and got all results back by ~7AM. Wow! They were run serially, while the server was also running twenty blastp jobs from another user.
set1 21mer counting took 585.523s.
set2 21mer counting took 567.74s.
set1+set2 21mer counting took 1507.44s.
set1 27mer counting took 612.069s.
set2 27mer counting took 534.832s.
set1+set2 27mer counting took 1361.61s.
set1 33mer counting took 1197.23s.
set1 33mer counting took 1037.95s.
set1 33mer counting took 2440.34s.
We noticed a jump in time needed for k>32, but no big differences after that for k=33, 39, 45 and 51.
Here is the output file of first k-mer counting job (set1 for k=21) so that you have an idea about all temporary space requirements.
1st stage: 258.373s
2nd stage: 327.15s
Total : 585.523s
Tmp size : 129830MB
No. of k-mers below min. threshold : 3342104765
No. of k-mers above max. threshold : 0
No. of unique k-mers : 4734755445
No. of unique counted k-mers : 1392650680
Total no. of k-mers : 126120517209
Total no. of reads : 1588478432
Total no. of super-k-mers : 15387969776
1st stage: 252.294s
2nd stage: 315.446s
Total : 567.74s
Tmp size : 112898MB
Command to Run and Output Files
The command kmc can be run either on a single fasta/fastq file or multiple files. The first two commands below are for single files and the third one is for a file named ‘list’, which lists both fasta files.
./kmc -k21 -m200 -fa s200-data.fa out21-200 tmp
./kmc -k21 -m200 -fa s300-data.fa out21-300 tmp
./kmc -k21 -m200 -fa @list out21-both tmp
The program kmc produces two binary output files (out21-200.kmc_pre – 1.1G and out21-200.kmc_suf 5.2G). Interestingly, in all k-mer counting examples we ran, the size of the ‘pre’ file varied between 1073807452 bytes and 67174492 bytes, and we do not have much clue why. This is where the code will come handy. One clue – the file size appears to be k-dependent.
The program kmc_dump takes those two binary files, and gives you the k-mer counts in two-column text format. Command to run -
./kmc_dump -cx0 out21-200 o
If the above commentary appears rather drab, the next commentary will surely add some color to it.
In the last commentary, we have shown a few examples of doing k-mer counting with the new super-efficient program – kmc-2. This commentary will provide the context. Readers unfamiliar with how to interpret k-mer distributions of short reads are also encouraged to go through an old blog post – Maximizing Utility of Available RAMs in K-mer World.
Here is our story. We started assembling two sets of Illumina libraries discussed in previous commentary using Minia and noticed something odd. At first, we assembled the set2 (see previous commentary for reference) at k=63 using Minia. Total size of all assembled contigs came to be around 700Mb with N50=982. So far, so good.
Another bioinformatician helped us assemble set1 and set2 together with Minia. Since the depth of sequencing was very high, we chose to do this assembly at k=73. The total assembly size came to be roughly the same, but, quite puzzlingly, N50 dropped to 219 !
To figure out what was going on, we took set2 and did k-mer counting at k=63. The distribution came out to be usual (unimodal with peak around 23, ignoring the noise peak near count=1). As explained in the earlier commentary, the location of the mode of the distribition is attributable to the depth of sequencing and the amount of repeats in the genome. The peak at 23 was somewhat low given how deeply this genome was sequenced (~50 was expected), but we also knew that this organism had extensive duplication. Therefore, there was nothing odd with the k-mer distribution as such.
Armed with the new k-mer counter, we started to do forensic analysis on the libraries and here is what we found. The following three figures show k-mer distributions for k=21. Especially pay attention to the third chart below.
first set, k=21
second set, k=21
both sets, k=21
The above chart showing k-mer distribution of the combined libraries has two distinct peaks and one at the right location (~120) derived from the coverage, whereas the other one is at half of that number. The above pattern very likely indicates high degree of heterozygosity in our sample. Two chromosomes in diploid genomes are usually very similar and therefore total k-mer coverage is unimodal. When the difference between the chromosomes start to increase, a second peak appears at location that is half of the primary peak.
We first thought that the two PE libraries (set1 and set2) were prepared with samples from two different animals, but our collaborator confirmed that they were from the same animal. In fact, k-mer distribution from each set also shows two separate peaks, even though those peaks are somewhat less pronounced than when the sets are combined together.
When we plotted the distribution at increasing k-values, quite an interesting picture emerged. First check the following figures and then see the bottom of the post for interpretation.
first set, k=27
second set, k=27
both sets, k=27
first set, k=33
second set, k=33
both sets, k=33
first set, k=39
second set, k=39
both sets, k=39
first set, k=45
second set, k=45
both sets, k=45
both sets, k=63
What is going on? We suspect the differences between the two chromosomes are in many short patches and are present all over the genome. Also, we can guess that their relative distances are around 40 (27 + 27/2). So, when we take k-mer distribution at k=21, a large number of k-mers from two chromosomes appear to be identical, thus making the peak at higher coverage taller than the peak at lower coverage. As k is increased from 21 to 27, 33, 39, 45, 51 and finally 63, two chromosomes start to differ everywhere, thus leaving with very few identical k-mers from both copies of the chromosomes. Therefore, the peak at higher value disappears at k=63.
That would certainly be very odd and these charts will keep us awake all night.
Other interpretations are welcome ! Also please let me know, if the descriptions provided above of what we are doing are not sufficient.