Receive a weekly summary and discussion of the top papers of the week by leading researchers in the field.

In Bioinformatics (Oxford, England)

MOTIVATION : We face an increasing flood of genetic sequence data, from diverse sources, requiring rapid computational analysis. Rapid analysis can be achieved by sampling a subset of positions in each sequence. Previous sequence-sampling methods, such as minimizers, syncmers, and minimally-overlapping words, were developed by heuristic intuition, and are not optimal.

RESULTS : We present a sequence-sampling approach that provably optimizes sensitivity for a whole class of sequence comparison methods, for randomly-evolving sequences. It it likely near-optimal for a wide range of alignment-based and alignment-free analyses. For real biological DNA, it increases specificity by avoiding simple repeats. Our approach generalizes universal hitting sets (which guarantee to sample a sequence at least once), and polar sets (which guarantee to sample a sequence at most once). This helps us understand how to do rapid sequence analysis as accurately as possible.

AVAILABILITY AND IMPLEMENTATION : Source code freely available at https://gitlab.com/mcfrith/noverlap.

SUPPLEMENTARY INFORMATION : Supplementary data are available at Bioinformatics online.

Frith Martin C, Shaw Jim, Spouge John L

2023-Jan-25