28 Jan 2021

Benchmarking Nanopore basecallers: some observations on the Bonito basecaller

 We have sequenced several fish genomes on our MinION. Whenever there is a new version of the Guppy basecaller I re-basecall a small dataset from each species and align the raw sequences to previously published, independent references. Using Heng Li's one-liner for sequence identity, I get an estimate of the raw error rate of the sequences. 




Frequency distrubutions of percent identity to reference for Species 1. 


The Bonito 441 basecaller (using the res_dna_r941_min_crf_v031 model from Rerio) has a nice improvement in raw accuracy. At the moment this comes at the cost of slower basecalling speeds (~3 times slower on our GTX 1080 GPU). According to ONT a speed upgrade should be coming soon with a new Guppy release!

In my tests Bonito resulted in slightly less total bases, but slightly higher proportion of those reads mapped to the reference (using MiniMap2 and Samtools).
 



In Bonito the low default chunk_size of 720 may be reducing slightly the accuracies. Setting chunk_size instead to 1000 resulted in a small improvement in the accuracies. Setting it to 1200 or higher caused it to crash. 

Lastly, it seems the fastq quality scoring is broken in Bonito, seeing how there is no relation between the quality scores and percent match when mapping the reads to the reference genome (unlike in Guppy):





Plots were made in NanoPlot