minibwa, Unpacked, Part 1: The Algorithm

I recently came across a bioRxiv paper from Heng Li, the godfather of modern bioinformatics, about a new mapping method called minibwa. I had a working idea of how read mapping worked, but I had never really chased the details all the way down. With Claude and Codex as study partners, digging into a new method at the level my scientific curiosity wanted stopped feeling like a chore and started feeling like an actual learning trip. So I got to work. This three-part series is the set of questions I had about how minibwa compares with bwa-mem and minimap2, and what makes it stand out.

A 150-base read shows up from the sequencer. Somewhere in the 3.1-billion-base human reference, there is probably a place it came from. The whole algorithm is a way to avoid comparing that read against all 3.1 billion bases directly.

The trick is seed, chain, align: find cheap exact matches, connect the ones that make geometric sense, and only then pay for dynamic programming in the tiny gaps that remain. minibwa keeps bwa-mem’s FM-index and SMEM seeding, then borrows minimap2-style chaining and alignment for the back half.

The job

You have a few hundred million reads, each 100–250 bp, sequenced with errors and from a genome that differs from your reference by SNPs and indels. For each read you want the position in the reference it most likely came from, an alignment (the CIGAR), and a confidence (the mapping quality). That’s mapping.

Why you can’t just align everything

The textbook answer to “where does this string best fit in that string” is Smith–Waterman: a dynamic-programming table of size read length × reference length. For one human read that is one row of 100–odd cells across all 3.1 billion reference bases. Multiply by hundreds of millions of reads and the arithmetic is absurd. Turn the knobs and watch:

The bar for brute force runs off the screen because it is off the screen — it’s thousands of CPU-years. The trick that makes mapping possible is to never run DP against the whole genome. Instead:

The skeleton: seed, chain, align

Almost every modern aligner is the same three moves. Step through them:

Seeding finds short stretches that match the reference exactly. Exact matching is cheap because we can look it up in an index instead of computing DP — that’s what the FM-index section below is about. Each seed says “this little piece of the read is here, here, and here in the genome.”

Chaining takes the scattered seeds and finds a set of them that line up — same diagonal, increasing in both read and reference coordinates. A good colinear chain is a strong hypothesis for where the read belongs. Most of the genome gets no chain at all, so it never costs us anything.

Alignment only now runs Smith–Waterman — and only in the small gaps between chained seeds, over a few hundred bases, not three billion. This is where the CIGAR and the exact base-level differences come from.

The whole point Exact matching is something an index can do in microseconds. Approximate matching (DP) is expensive. So we use cheap exact matching to throw away 99.999% of the genome, and pay for expensive DP only on the tiny survivors.

Where the three tools differ

Given that everyone uses the same skeleton, what distinguishes the tools is which engine drives each step and how fast those engines run. minibwa’s thesis is that bwa-mem had the best seeding idea and minimap2 had the best chaining and alignment code — so take both, and then make the seeding engine scream. The rest of this post is that sentence, unpacked.

Step	bwa-mem	minimap2	minibwa
Index	FM-index (BWT)	minimizer hash	FM-index
Seeds	SMEMs (exact, variable length)	minimizers (fixed k-mers)	SMEMs
Chaining	tree-based, heuristic	colinear DP	colinear DP
Alignment	own extension code	ksw2 SIMD	ksw2 SIMD

Say it out loud “Mapping is impossible by brute force, so we seed with cheap exact matches, chain the ones that line up, and run real alignment only in the gaps. minibwa keeps bwa-mem’s seeds and minimap2’s chain-and-align.”

You know the BWT already. What you may not have built is the realization that the BWT plus two small lookup tables is a search engine for exact substrings. minibwa’s index is almost identical to bwa-mem’s, so this section is really about understanding the tool both share.

The Burrows–Wheeler matrix, and why sorting helps

Take the reference, append a sentinel $ that sorts before everything, and list all rotations sorted alphabetically. The first column (call it F) is just the sorted letters of the genome. The last column (L) is the BWT. The magic is the LF-mapping: the i-th occurrence of a letter in L is the same physical genome position as the i-th occurrence of that letter in F. That correspondence lets us walk the index.

Backward search

To find a pattern, we feed its letters in reverse. We keep a half-open interval [lo, hi) of rows in the matrix — the rows whose sorted suffix starts with the part of the pattern matched so far. Each new letter c shrinks the interval with two table lookups:

lo ← C[c] + Occ(c, lo) hi ← C[c] + Occ(c, hi)

C[c] is how many genome letters sort before c; Occ(c, i) is how many c’s appear in the BWT above row i. The width hi − lo is the number of occurrences — the moment it hits zero, the pattern isn’t in the genome. Try it:

From interval to genome positions

The interval gives a count, but seeding also needs the actual coordinates. Each row in the interval corresponds to one suffix-array entry — one genome position. To save memory, only a fraction of the suffix array is stored (minibwa samples 1 in 16, up from bwa-mem’s 1 in 32), and the rest are recovered by LF-walking until you hit a sampled row. More samples means fewer walk steps — a small memory cost for a speed win that matters once you’re doing this billions of times.

The detail that drives the rest of the tutorial Look at the table during a search: the active rows jump around unpredictably. Each backward step lands on a different, far-apart region of the BWT in memory. On a real 3 GB index that means a cache miss almost every step — the CPU stalls hundreds of cycles waiting for RAM. Hold that thought. It is the entire reason the performance-engineering post exists.

How minibwa’s index differs in practice

The data structure is bwa-mem’s, but the file format and build path are not. minibwa builds the BWT with libsais by default — multi-threaded and much faster than bwa-mem’s old construction, at the cost of more RAM (the classic bwa-mem builder is still available for tight-memory machines). It also stores the plain 2-bit genome in a new l2bit format (like UCSC’s 2-bit, but able to address contigs longer than 2³² bases), which alignment reads from directly.

At the API level minibwa adopts ropebwt3’s half-open intervals instead of bwa-mem’s closed ones — a small bookkeeping change that makes the batched seeding logic cleaner.

Say it out loud “The BWT plus a count table lets me find every exact occurrence of a string by feeding its letters backward and shrinking an interval. It’s fast in operations — but each step is a random jump in memory, and that’s what we’ll have to fight.”

What makes a seed “super-maximal”

A read that truly came from some genomic locus matches it perfectly except at the handful of positions where there’s a SNP or a sequencing error. Those mismatch positions chop the read into runs of exact agreement. The longest such runs — the ones you cannot extend left or right without hitting a mismatch, and which aren’t contained inside a longer match — are the supermaximal exact matches (SMEMs).

minibwa hunts for (19, 1)-SMEMs: matches at least 19 bp long that occur at least once. The length floor of 19 is no accident — shorter than that and a match occurs so often by chance that it’s useless. Click a base in the read below to introduce a mismatch (a SNP or sequencing error) and watch the seeds split:

The repeat problem, and the second pass

Here’s the catch that makes seeding subtle. A long SMEM is great when it occurs in just one place. But genomes are repetitive: a long match can still occur in dozens of places, and worse, a long SMEM can swallow a more informative shorter seed that sits inside it. If you only ever take the single longest match, you can miss the seed that would have placed the read correctly.

bwa-mem’s fix, kept by minibwa, is a second seeding pass. After collecting the standard SMEMs, minibwa re-seeds inside any SMEM that is both long and repetitive — in the code, when its span is at least twice the minimum and its occurrence count is modest. It looks for (mid, occ+1)-SMEMs: shorter matches, starting from the middle, that occur more often than the parent — the buried, more-specific anchors. You can see both passes in mb_seed_intv():

do { // pass 1: standard SMEMs
    x = mb_bwt_smem(bwt, len, seq, x, min_len, 1, &p);
    ...
} while (x < len);
for (i = 0; i < n_a0; ++i) { // pass 2: sub-SMEMs inside long, repetitive ones
    if (en - st < min_len * 2 || v->a[i].size > max_sub_occ) continue;
    ...
    x = mb_bwt_smem(bwt, en, seq, x, sub_min_len, v->a[i].size + 1, &p);
}
  

bwa-mem could only afford one cheap second pass with its old engine. minibwa’s reimplemented SMEM finder makes the second pass nearly free, so it can lean on re-seeding harder for better accuracy in repeats. The toy widget above shows occurrence counts so you can see which seeds would trigger a second pass — the long ones that still match in more than one place.

Why SMEMs and not minimizers minimap2 seeds with fixed-length minimizers — great for long, noisy reads. bwa-mem’s variable-length SMEMs are more sensitive for short, accurate reads, because a single SMEM can span almost the whole read when there are no errors, giving an unambiguous anchor. minibwa keeps SMEMs precisely because its target is short, accurate reads.

Say it out loud “Seeds are the read’s exact-match runs between mismatches. I take the longest ones — SMEMs — and then make a second pass inside the long repetitive ones to dig out the shorter, more-specific seeds I’d otherwise miss.”

The dotplot picture

Put read position on one axis and reference position on the other. Every seed match is a point (really a short diagonal segment, since a seed has length). A read that came from one contiguous locus shows up as points strung along a single diagonal — as you move right along the read, you move right along the reference in lockstep. Spurious seeds scatter off that diagonal.

Chaining’s job: find the highest-scoring colinear set — anchors increasing in both coordinates, close to one diagonal. Gaps between consecutive anchors are allowed (that’s an indel or an error-induced break), but they cost score, and the cost grows with how far apart and how off-diagonal they are. Hit Run chaining:

How the chaining decides — one DP recurrence

It’s dynamic programming over anchors sorted by position. For anchor i, the best chain ending at i is its own weight plus the best predecessor j that sits below-left of it, minus the cost of the jump from j to i:

f(i) = max_j [ f(j) + overlap-adjusted weight(i) − gap_cost(j→i) ]

The gap cost is the crux. A jump where read-distance and reference-distance match (same diagonal) is nearly free — that’s just the read continuing. A jump that’s longer on one axis than the other implies an insertion or deletion, and it’s penalized by how big the discrepancy is. Anchors too far apart aren’t even considered. Toggle the indel checkbox above: the chain survives a moderate diagonal jump because the gap cost stays below the reward of extending the chain.

The wrinkle minibwa had to fix: variable-length seeds

minimap2 chains minimizers — fixed-length, regularly spaced anchors. minibwa feeds it SMEMs, which vary wildly in length and can overlap each other (recall the two-pass re-seeding above deliberately produces overlapping seeds). If you naively scored each seed by its full length, overlapping seeds would double-count the bases they share and inflate the chain score.

So minibwa adapts the chaining to count only the new bases each anchor contributes beyond the previous one — the overlap-adjusted weight in the recurrence above. This is the “adapted for variable-length seeds” line from the design notes, and it’s the one place minimap2’s chaining couldn’t be used unmodified.

Why chaining, not bwa-mem’s extension bwa-mem extended each seed independently and glued results together with heuristics — fine for 100 bp reads, shaky for anything longer or with structural differences. Colinear chaining reasons about all seeds jointly, so it handles larger indels and longer reads gracefully. This is why minibwa calls >10 bp indels slightly better than bwa-mem in the variant-calling benchmark, and why it works on long reads at all.

A read can produce more than one good chain — think segmental duplications or a read spanning a structural variant. minibwa keeps up to the 50 best chains for a short read and sends each to the alignment stage, where base-level alignment begins.

Say it out loud “On a read-vs-reference dotplot, the true alignment is a string of seeds on one diagonal. Chaining is the dynamic program that finds that string, paying a penalty for off-diagonal jumps. minibwa uses minimap2’s chaining, tweaked so overlapping variable-length seeds don’t double-count.”

Base alignment: only pay DP in small boxes

Chaining gives a plausible genomic location and a run of anchors, but it does not tell you the exact mismatches, insertions, deletions, or CIGAR string. That is where Smith–Waterman dynamic programming comes back, just in a much smaller box: the short gaps between chained seeds and the read ends.

minibwa uses minimap2’s ksw2 routines for that base-level work: vectorized Smith–Waterman with affine gap costs. The algorithmic shape stays the same: exact seeds explain most of the read, the chain chooses the candidate locus, and DP cleans up only the places exact matching could not explain.

Say it out loud “minibwa maps by finding SMEM seeds with a BWT/FM-index, chaining the seeds that sit on one diagonal, and running SIMD alignment only in the small gaps left between anchors.”

That is the algorithm. The next layer is why this same algorithm gets much faster when you reshape the hot loops around memory latency and cheap filters.

Sources and credit: this explanation is based on Heng Li and Nils Homer's minibwa paper, and on the design lineage from BWA / bwa-mem and minimap2. The interactive tutorial and widgets were drafted with Opus 4.8.