minibwa, Unpacked, Part 2: Performance Engineering

The most important performance fact in minibwa is not a clever alignment trick. It is that RAM is slow, and the FM-index makes you jump around RAM constantly.

Once you see that, the engineering choices line up: batch independent searches, prefetch what the next step will need, cache the common prefixes, and run cheap tests before expensive dynamic programming.

The problem: the index doesn’t fit in cache

Recall from the FM-index discussion that each backward-search step lands on an unpredictable, far-apart row of the BWT. The human index is about 3 GB — hundreds of times larger than the CPU’s last-level cache. So almost every step is a cache miss: the CPU asks RAM for a byte and then stalls for a couple hundred cycles doing nothing while the data crawls back.

The actual work — the C[c] + Occ(c,·) arithmetic — takes only a handful of cycles. So in naive seeding the CPU spends the overwhelming majority of its time idle, waiting on memory. Watch a single query crawl through its steps:

The fix: batch many queries and prefetch

The queries are independent — read A’s seeding has nothing to do with read B’s. So instead of finishing one query before starting the next, minibwa keeps a batch of queries in flight. The trick has two halves working together:

Prefetch. Before the CPU will need a BWT block, it issues a __builtin_prefetch hint telling the hardware to start pulling that block into cache now.
Interleave. Then, instead of stalling, the CPU switches to a different query in the batch and does its compute. By the time it cycles back, the prefetched data has arrived. The wait is hidden behind useful work.

Drag the batch slider above past 3 and the gray stall bars vanish — the compute blocks tile densely because each query’s memory wait is filled by its neighbors’ arithmetic. That is the entire mechanism. The real bwt.c is littered with it:

// bwt.c — prefetch the BWT blocks the NEXT iteration will touch
mb_bwt_block_prefetch(bwt, s->p.x[1]);
mb_bwt_block_prefetch(bwt, s->p.x[1] + s->p.size);
// ...then move on to another query in the batch instead of stalling
  

minibwa applies this everywhere a random memory access hides: SMEM finding (mb_bwt_smem_batch, ~2.5× faster), suffix-array lookups (mb_bwt_sa_batch, over 4× faster), even across both seeding passes (mb_seed_intv_batch). bwa-mem couldn’t do the two-pass-with-prefetch trick because its older SMEM algorithm wasn’t structured for it; minibwa reimplemented SMEM finding (borrowing Travis Gagie’s formulation from ropebwt3) specifically because the simpler structure makes batching possible.

The reframe that matters The algorithm got simpler, not smarter, and that’s the point. A simpler loop with no data-dependent control flow is one you can software- pipeline: issue the loads early, interleave independent work, never stall. On modern hardware, “don’t wait for RAM” beats “do fewer operations” almost every time.

One more trick: the 10-mer cache

The first few backward-search steps of every seed are the same kind of work on the same tiny set of short prefixes. So minibwa precomputes the SA interval for all 4¹⁰ possible 10-mers once (mb_bwt_cache) and stores them in a flat table. A seed then skips its first 10 random-memory backward steps and starts from a single cache lookup. Ten unpredictable RAM jumps collapse into one:

In plain English The index is too big for cache, so every seeding step would stall the CPU waiting on RAM. minibwa keeps many reads in flight at once and prefetches — so while one read waits for memory, the CPU does another read’s arithmetic. The waiting disappears, and that is where the speed comes from.

DP only in the gaps — and only when needed

The chain already pins down most of the read with exact seeds. Real DP runs only in the short stretches between chained seeds and at the read ends, using minimap2’s ksw2 routines — vectorized Smith–Waterman with a dual affine-gap model, hand-written in SSE4.1 on x86 and NEON on ARM. (The authors tried AVX2 and saw no clear improvement, so they didn’t bother.) Even so, this DP is about 35% of total CPU time on whole-genome short reads — the single biggest cost after seeding.

So minibwa adds a fast path: it first attempts an ungapped alignment of the gap. If that comes out clean — few enough mismatches — it accepts it and skips the DP entirely. Only when the ungapped attempt looks bad (suggesting a real indel) does it fall back to full SIMD Smith–Waterman. Most gaps in most reads have no indel, so the fast path fires constantly.

The shape of the work Exact seeds (free, from the index) cover most of the read → ungapped check (cheap) handles the clean gaps → SIMD DP (expensive) runs only on the few gaps that actually contain an indel. Each tier is an order of magnitude rarer and costlier than the last. That layering is the whole performance philosophy of the aligner in miniature.

Pairing, and the mate-rescue trap

Short reads usually come in pairs from the two ends of a fragment, so the two mates should land close together in the correct orientation. minibwa pairs them with bwa-mem’s logic. The expensive part is mate rescue: when one mate maps confidently but the other doesn’t, you search the region near the mapped mate for the missing one — running Smith–Waterman over that whole window.

The trap: most of the time the missing mate genuinely isn’t there (it was junk, or off in a repeat), and you’ve burned a full DP for nothing. minibwa adds a pre-alignment filter (mb_ungap) to predict, cheaply, whether a real alignment even exists before paying for DP.

The filter: voting on a diagonal, Hough-style

The idea is borrowed from the Hough transform for finding lines. If the missing mate really sits in this window, then many short q-mers (minibwa uses q = 7) of the read will match the reference at the same diagonal offset — they all “vote” for the same line. If the read isn’t really there, its q-mer matches scatter randomly across offsets and no offset gets many votes.

So: slide the read across the window and tally how many q-mers match at each offset. If the best offset gets at least 10 votes, run the full Smith–Waterman. Otherwise, skip it — there’s no line to find. Add mismatches to the read below and watch the matching q-mers on the true diagonal disappear:

The real threshold is 10 votes with 7-mers; the widget uses 6 votes with shorter q-mers so the idea fits in a toy sequence. The payoff is real: the paper notes this filter lets minibwa skip unsuccessful alignments, and disabling it is exactly what narrows minibwa’s lead over bwa-mem2 on Hi-C and other data where mate rescue is heavy. A cheap vote replaces an expensive DP that was going to fail anyway.

The recurring move This is the same pattern for a third time: a cheap test that predicts whether the expensive operation will succeed, run before the expensive operation. Prefetch hid latency; the ungapped fast path skipped needless DP; the Hough filter skips needless mate rescue — the same instinct applied in three different places.

In plain English Real alignment runs only in the gaps between seeds, and only when an ungapped check says there is an indel worth the DP. For paired reads, a cheap 7-mer vote decides whether the missing mate is even findable before spending a full Smith–Waterman looking for it.

This is the performance story in one sentence: minibwa does not magically remove the hard work, but it refuses to pay for it while the CPU is idle or when a cheap test already says the expensive step will fail.

Sources and credit: this explanation is based on Heng Li and Nils Homer's minibwa paper, and on the design lineage from BWA / bwa-mem and minimap2. The interactive tutorial and widgets were drafted with Opus 4.8.