27 Aug 2020

Benchmarking Nanopore basecallers

We have sequenced several fish genomes on our MinION. Whenever there is a new version of the Guppy basecaller I re-basecall a small dataset from each species and align the raw sequences to previously published, independent references. Using Heng Li's one-liner for sequence identity, I get an estimate of the raw error rate of the sequences. 




Each color represents a different fish species' genome. These were all basecalled with the high accuracy model (HAC). In all cases HAC was more accurate than the fast model. As can be seen, new versions often had no improvement in the accuracy, but may have had other improvements with regard to e.g. speed, stability, new functionality, etc. This figure will be updated as new basecaller versions are released. 


The estimated error rate may seem a little on the optimistic side. The rate corresponds roughly to the peak in the distribution of error rates. This distribution has a long tail of higher error rates (although some of this will be removed by quality filtering the reads). Also, it is not clear that the increase in raw accuracy always leads to better assemblies in terms of contig sizes and gene completeness. In our (still limited) experience, often the slightly older models have given better assemblies. Therefore it might be a good idea to keep around the older basecaller versions and not autmatically assume the latest is always the best.