minibwa, Unpacked, Part 2: Performance Engineering
The most important performance fact in minibwa is not a clever alignment trick. It is that RAM is slow, and the FM-index makes you jump around RAM constantly.
Once you see that, the engineering choices line up: batch independent searches, prefetch what the next step will need, cache the common prefixes, and run cheap tests before expensive dynamic programming.
The problem: the index doesn’t fit in cache
Recall from the FM-index discussion that each backward-search step lands on an unpredictable, far-apart row of the BWT. The human index is about 3 GB — hundreds of times larger than the CPU’s last-level cache. So almost every step is a cache miss: the CPU asks RAM for a byte and then stalls for a couple hundred cycles doing nothing while the data crawls back.
The actual work — the C[c] + Occ(c,·) arithmetic
— takes only a handful of cycles. So in naive seeding the CPU spends
the overwhelming majority of its time idle, waiting on memory. Watch a
single query crawl through its steps:
The fix: batch many queries and prefetch
The queries are independent — read A’s seeding has nothing to do with read B’s. So instead of finishing one query before starting the next, minibwa keeps a batch of queries in flight. The trick has two halves working together:
- Prefetch. Before the CPU will need a BWT block, it issues a
__builtin_prefetchhint telling the hardware to start pulling that block into cache now. - Interleave. Then, instead of stalling, the CPU switches to a different query in the batch and does its compute. By the time it cycles back, the prefetched data has arrived. The wait is hidden behind useful work.
Drag the batch slider above past 3 and the gray stall bars vanish —
the compute blocks tile densely because each query’s memory wait is
filled by its neighbors’ arithmetic. That is the entire mechanism.
The real bwt.c is littered with it:
minibwa applies this everywhere a random memory access hides: SMEM finding
(mb_bwt_smem_batch, ~2.5× faster),
suffix-array lookups (mb_bwt_sa_batch, over
4× faster), even across both seeding
passes (mb_seed_intv_batch). bwa-mem
couldn’t do the two-pass-with-prefetch trick because its older SMEM
algorithm wasn’t structured for it; minibwa reimplemented SMEM finding
(borrowing Travis Gagie’s formulation from ropebwt3) specifically
because the simpler structure makes batching possible.
One more trick: the 10-mer cache
The first few backward-search steps of every seed are the same kind
of work on the same tiny set of short prefixes. So minibwa precomputes the
SA interval for all 410 possible 10-mers once
(mb_bwt_cache) and stores them in a flat table. A seed then
skips its first 10 random-memory backward steps and starts from a single
cache lookup. Ten unpredictable RAM jumps collapse into one:
DP only in the gaps — and only when needed
The chain already pins down most of the read with exact seeds. Real DP runs only in the short stretches between chained seeds and at the read ends, using minimap2’s ksw2 routines — vectorized Smith–Waterman with a dual affine-gap model, hand-written in SSE4.1 on x86 and NEON on ARM. (The authors tried AVX2 and saw no clear improvement, so they didn’t bother.) Even so, this DP is about 35% of total CPU time on whole-genome short reads — the single biggest cost after seeding.
So minibwa adds a fast path: it first attempts an ungapped alignment of the gap. If that comes out clean — few enough mismatches — it accepts it and skips the DP entirely. Only when the ungapped attempt looks bad (suggesting a real indel) does it fall back to full SIMD Smith–Waterman. Most gaps in most reads have no indel, so the fast path fires constantly.
Pairing, and the mate-rescue trap
Short reads usually come in pairs from the two ends of a fragment, so the two mates should land close together in the correct orientation. minibwa pairs them with bwa-mem’s logic. The expensive part is mate rescue: when one mate maps confidently but the other doesn’t, you search the region near the mapped mate for the missing one — running Smith–Waterman over that whole window.
The trap: most of the time the missing mate genuinely isn’t there (it
was junk, or off in a repeat), and you’ve burned a full DP for
nothing. minibwa adds a pre-alignment filter
(mb_ungap) to predict, cheaply, whether a real alignment even
exists before paying for DP.
The filter: voting on a diagonal, Hough-style
The idea is borrowed from the Hough transform for finding lines. If the
missing mate really sits in this window, then many short q-mers
(minibwa uses q = 7) of the read will match the
reference at the same diagonal offset — they all “vote”
for the same line. If the read isn’t really there, its q-mer matches
scatter randomly across offsets and no offset gets many votes.
So: slide the read across the window and tally how many q-mers match at each offset. If the best offset gets at least 10 votes, run the full Smith–Waterman. Otherwise, skip it — there’s no line to find. Add mismatches to the read below and watch the matching q-mers on the true diagonal disappear:
The real threshold is 10 votes with 7-mers; the widget uses 6 votes with shorter q-mers so the idea fits in a toy sequence. The payoff is real: the paper notes this filter lets minibwa skip unsuccessful alignments, and disabling it is exactly what narrows minibwa’s lead over bwa-mem2 on Hi-C and other data where mate rescue is heavy. A cheap vote replaces an expensive DP that was going to fail anyway.
This is the performance story in one sentence: minibwa does not magically remove the hard work, but it refuses to pay for it while the CPU is idle or when a cheap test already says the expensive step will fail.
Sources and credit: this explanation is based on Heng Li and Nils Homer's minibwa paper, and on the design lineage from BWA / bwa-mem and minimap2. The interactive tutorial and widgets were drafted with Opus 4.8.