minibwa, Unpacked, Part 1: The Algorithm
I recently came across a bioRxiv paper from Heng Li, the godfather of modern bioinformatics, about a new mapping method called minibwa. I had a working idea of how read mapping worked, but I had never really chased the details all the way down. With Claude and Codex as study partners, digging into a new method at the level my scientific curiosity wanted stopped feeling like a chore and started feeling like an actual learning trip. So I got to work. This three-part series is the set of questions I had about how minibwa compares with bwa-mem and minimap2, and what makes it stand out.
A 150-base read shows up from the sequencer. Somewhere in the 3.1-billion-base human reference, there is probably a place it came from. The whole algorithm is a way to avoid comparing that read against all 3.1 billion bases directly.
The trick is seed, chain, align: find cheap exact matches, connect the ones that make geometric sense, and only then pay for dynamic programming in the tiny gaps that remain. minibwa keeps bwa-memβs FM-index and SMEM seeding, then borrows minimap2-style chaining and alignment for the back half.
The job
You have a few hundred million reads, each 100–250 bp, sequenced with errors and from a genome that differs from your reference by SNPs and indels. For each read you want the position in the reference it most likely came from, an alignment (the CIGAR), and a confidence (the mapping quality). That’s mapping.
Why you can’t just align everything
The textbook answer to “where does this string best fit in that
string” is Smith–Waterman: a dynamic-programming table of size
read length × reference length. For one human read that
is one row of 100–odd cells across all 3.1 billion reference bases.
Multiply by hundreds of millions of reads and the arithmetic is absurd. Turn
the knobs and watch:
The bar for brute force runs off the screen because it is off the screen — it’s thousands of CPU-years. The trick that makes mapping possible is to never run DP against the whole genome. Instead:
The skeleton: seed, chain, align
Almost every modern aligner is the same three moves. Step through them:
Seeding finds short stretches that match the reference exactly. Exact matching is cheap because we can look it up in an index instead of computing DP — that’s what the FM-index section below is about. Each seed says “this little piece of the read is here, here, and here in the genome.”
Chaining takes the scattered seeds and finds a set of them that line up — same diagonal, increasing in both read and reference coordinates. A good colinear chain is a strong hypothesis for where the read belongs. Most of the genome gets no chain at all, so it never costs us anything.
Alignment only now runs Smith–Waterman — and only in the small gaps between chained seeds, over a few hundred bases, not three billion. This is where the CIGAR and the exact base-level differences come from.
Where the three tools differ
Given that everyone uses the same skeleton, what distinguishes the tools is which engine drives each step and how fast those engines run. minibwa’s thesis is that bwa-mem had the best seeding idea and minimap2 had the best chaining and alignment code — so take both, and then make the seeding engine scream. The rest of this post is that sentence, unpacked.
| Step | bwa-mem | minimap2 | minibwa |
|---|---|---|---|
| Index | FM-index (BWT) | minimizer hash | FM-index |
| Seeds | SMEMs (exact, variable length) | minimizers (fixed k-mers) | SMEMs |
| Chaining | tree-based, heuristic | colinear DP | colinear DP |
| Alignment | own extension code | ksw2 SIMD | ksw2 SIMD |
You know the BWT already. What you may not have built is the realization that the BWT plus two small lookup tables is a search engine for exact substrings. minibwa’s index is almost identical to bwa-mem’s, so this section is really about understanding the tool both share.
The Burrows–Wheeler matrix, and why sorting helps
Take the reference, append a sentinel $ that sorts before
everything, and list all rotations sorted alphabetically. The first column
(call it F) is just the sorted letters of the genome. The last column
(L) is the BWT. The magic is the LF-mapping: the i-th
occurrence of a letter in L is the same physical genome position as
the i-th occurrence of that letter in F. That correspondence
lets us walk the index.
Backward search
To find a pattern, we feed its letters in reverse. We keep a
half-open interval [lo, hi) of rows in the matrix — the
rows whose sorted suffix starts with the part of the pattern matched so far.
Each new letter c shrinks the interval with two table lookups:
C[c] is how many genome letters sort before c;
Occ(c, i) is how many c’s appear in the BWT
above row i. The width hi − lo is the number
of occurrences — the moment it hits zero, the pattern isn’t in
the genome. Try it:
From interval to genome positions
The interval gives a count, but seeding also needs the actual coordinates. Each row in the interval corresponds to one suffix-array entry — one genome position. To save memory, only a fraction of the suffix array is stored (minibwa samples 1 in 16, up from bwa-mem’s 1 in 32), and the rest are recovered by LF-walking until you hit a sampled row. More samples means fewer walk steps — a small memory cost for a speed win that matters once you’re doing this billions of times.
How minibwa’s index differs in practice
The data structure is bwa-mem’s, but the file format and build path
are not. minibwa builds the BWT with
libsais by default — multi-threaded and
much faster than bwa-mem’s old construction, at the cost of more RAM
(the classic bwa-mem builder is still available for tight-memory machines).
It also stores the plain 2-bit genome in a new
l2bit format (like UCSC’s
2-bit, but able to address contigs longer than 2³² bases), which
alignment reads from directly.
At the API level minibwa adopts ropebwt3’s half-open intervals instead of bwa-mem’s closed ones — a small bookkeeping change that makes the batched seeding logic cleaner.
What makes a seed “super-maximal”
A read that truly came from some genomic locus matches it perfectly except at the handful of positions where there’s a SNP or a sequencing error. Those mismatch positions chop the read into runs of exact agreement. The longest such runs — the ones you cannot extend left or right without hitting a mismatch, and which aren’t contained inside a longer match — are the supermaximal exact matches (SMEMs).
minibwa hunts for (19, 1)-SMEMs: matches at least 19 bp long that occur at least once. The length floor of 19 is no accident — shorter than that and a match occurs so often by chance that it’s useless. Click a base in the read below to introduce a mismatch (a SNP or sequencing error) and watch the seeds split:
The repeat problem, and the second pass
Here’s the catch that makes seeding subtle. A long SMEM is great when it occurs in just one place. But genomes are repetitive: a long match can still occur in dozens of places, and worse, a long SMEM can swallow a more informative shorter seed that sits inside it. If you only ever take the single longest match, you can miss the seed that would have placed the read correctly.
bwa-mem’s fix, kept by minibwa, is a
second seeding pass. After collecting the
standard SMEMs, minibwa re-seeds inside any SMEM that is both long
and repetitive — in the code, when its span is at least twice the
minimum and its occurrence count is modest. It looks for
(mid, occ+1)-SMEMs: shorter matches,
starting from the middle, that occur more often than the parent
— the buried, more-specific anchors. You can see both passes in
mb_seed_intv():
bwa-mem could only afford one cheap second pass with its old engine. minibwa’s reimplemented SMEM finder makes the second pass nearly free, so it can lean on re-seeding harder for better accuracy in repeats. The toy widget above shows occurrence counts so you can see which seeds would trigger a second pass — the long ones that still match in more than one place.
The dotplot picture
Put read position on one axis and reference position on the other. Every seed match is a point (really a short diagonal segment, since a seed has length). A read that came from one contiguous locus shows up as points strung along a single diagonal — as you move right along the read, you move right along the reference in lockstep. Spurious seeds scatter off that diagonal.
Chaining’s job: find the highest-scoring colinear set — anchors increasing in both coordinates, close to one diagonal. Gaps between consecutive anchors are allowed (that’s an indel or an error-induced break), but they cost score, and the cost grows with how far apart and how off-diagonal they are. Hit Run chaining:
How the chaining decides — one DP recurrence
It’s dynamic programming over anchors sorted by position. For anchor
i, the best chain ending at i is its own weight
plus the best predecessor j that sits below-left of it, minus
the cost of the jump from j to i:
The gap cost is the crux. A jump where read-distance and reference-distance match (same diagonal) is nearly free — that’s just the read continuing. A jump that’s longer on one axis than the other implies an insertion or deletion, and it’s penalized by how big the discrepancy is. Anchors too far apart aren’t even considered. Toggle the indel checkbox above: the chain survives a moderate diagonal jump because the gap cost stays below the reward of extending the chain.
The wrinkle minibwa had to fix: variable-length seeds
minimap2 chains minimizers — fixed-length, regularly spaced anchors. minibwa feeds it SMEMs, which vary wildly in length and can overlap each other (recall the two-pass re-seeding above deliberately produces overlapping seeds). If you naively scored each seed by its full length, overlapping seeds would double-count the bases they share and inflate the chain score.
So minibwa adapts the chaining to count only the new bases each anchor contributes beyond the previous one — the overlap-adjusted weight in the recurrence above. This is the “adapted for variable-length seeds” line from the design notes, and it’s the one place minimap2’s chaining couldn’t be used unmodified.
A read can produce more than one good chain — think segmental duplications or a read spanning a structural variant. minibwa keeps up to the 50 best chains for a short read and sends each to the alignment stage, where base-level alignment begins.
Base alignment: only pay DP in small boxes
Chaining gives a plausible genomic location and a run of anchors, but it does not tell you the exact mismatches, insertions, deletions, or CIGAR string. That is where Smith–Waterman dynamic programming comes back, just in a much smaller box: the short gaps between chained seeds and the read ends.
minibwa uses minimap2’s ksw2 routines for that base-level work: vectorized Smith–Waterman with affine gap costs. The algorithmic shape stays the same: exact seeds explain most of the read, the chain chooses the candidate locus, and DP cleans up only the places exact matching could not explain.
That is the algorithm. The next layer is why this same algorithm gets much faster when you reshape the hot loops around memory latency and cheap filters.
Sources and credit: this explanation is based on Heng Li and Nils Homer's minibwa paper, and on the design lineage from BWA / bwa-mem and minimap2. The interactive tutorial and widgets were drafted with Opus 4.8.