Building XGBoost from Scratch in Rust, Part 3 — Scaling to Terabytes
In Part 2, we built a working gradient boosted tree implementation. It produces correct, XGBoost-compatible models. But try running it on 100 million rows and you’ll be waiting a while. Let’s understand why, and how production systems solve it.
The Problem: Where Does Time Go?
Before optimizing anything, we need to know what’s slow. Here’s a simplified view of our training loop:
flowchart TD
subgraph "Per Boosting Round"
G["Compute Gradients<br/>O(n)"]
G --> T["Build Tree"]
subgraph T["Build Tree"]
N["For each node..."]
N --> F["For each feature..."]
F --> S["Sort samples by value<br/>O(k log k)"]
S --> C["Scan for best split<br/>O(k)"]
C --> F
end
T --> P["Update Predictions<br/>O(n)"]
end
classDef input fill:none,stroke:#60a5fa,stroke-width:2px
classDef highlight fill:none,stroke:#f472b6,stroke-width:2px
classDef output fill:none,stroke:#34d399,stroke-width:2px
class G input
class S,C highlight
class P output
Let’s profile where time actually goes:
| Operation | Complexity | % of Training Time |
|---|---|---|
| Split finding (sort + scan) | O(f × k log k) per node | 80-90% |
| Tree prediction | O(n × depth) per tree | 10-15% |
| Gradient computation | O(n) per round | 3-5% |
The bottleneck is clear: sorting samples for every feature at every node.
The Exact Greedy Algorithm
Our Part 2 implementation uses exact greedy split finding:
for &col_idx in feature_indices {
// Collect (value, grad, hess) tuples
let mut tuples: Vec<(f32, f32, f32)> = indices
.iter()
.map(|&i| (data.get(i, col_idx), gradients[i], hessians[i]))
.collect(); // ← Allocation: O(k)
tuples.sort_by(|a, b| a.0.total_cmp(&b.0)); // ← Sort: O(k log k)
// Linear scan for best split point
// ...
}
For a node with 1 million samples and 100 features:
- 100 allocations of 1M × 12 bytes = 1.2 GB of temporary memory
- 100 sorts of 1M elements = ~20M comparisons each
This is the exact algorithm—it considers every possible split point. Accurate, but expensive.
Thinking Distributed: The Communication Challenge
Now imagine we have 1 billion rows across 100 machines. The obvious approach:
flowchart LR
subgraph "Worker 0"
W0["Rows 0-10M"]
end
subgraph "Worker 1"
W1["Rows 10M-20M"]
end
subgraph "Worker N"
WN["Rows ..."]
end
W0 --> |"Send all (value, g, h)"| M["Master<br/>Sort globally<br/>Find split"]
W1 --> M
WN --> M
M --> |"Broadcast split"| W0
M --> |"Broadcast split"| W1
M --> |"Broadcast split"| WN
classDef input fill:none,stroke:#60a5fa,stroke-width:2px
classDef highlight fill:none,stroke:#f472b6,stroke-width:2px
class W0,W1,WN input
class M highlight
This requires O(n) communication per feature per node. With 1B rows and 100 features, we’re moving terabytes of data through the network just to find one split.
The insight: we don’t need exact split points. We need good enough split points with minimal communication.
The Histogram Trick
Instead of sorting continuous values, we bin them into discrete buckets:
flowchart TD
subgraph "Continuous Values"
V["0.1, 0.3, 0.7, 0.2, 0.9, 0.4, 0.8, 0.5"]
end
subgraph "Binned (4 buckets)"
B0["Bin 0: [0, 0.25)<br/>0.1, 0.2"]
B1["Bin 1: [0.25, 0.5)<br/>0.3, 0.4"]
B2["Bin 2: [0.5, 0.75)<br/>0.7, 0.5"]
B3["Bin 3: [0.75, 1.0)<br/>0.9, 0.8"]
end
V --> B0
V --> B1
V --> B2
V --> B3
classDef input fill:none,stroke:#60a5fa,stroke-width:2px
classDef output fill:none,stroke:#34d399,stroke-width:2px
class V input
class B0,B1,B2,B3 output
Now instead of tracking individual samples, we track histogram bins:
\[\text{Histogram}[f][b] = \left( \sum_{i \in \text{bin } b} g_i, \sum_{i \in \text{bin } b} h_i, \text{count} \right)\]Building Histograms: O(n) Single Pass
fn build_histogram(
data: &DMatrix,
indices: &[usize],
gradients: &[f32],
hessians: &[f32],
bin_boundaries: &[Vec<f32>], // Pre-computed quantiles per feature
) -> Vec<Vec<HistBin>> {
let n_features = data.n_cols;
let n_bins = 256; // Typical choice
// Initialize: O(features × bins)
let mut histograms = vec![vec![HistBin::default(); n_bins]; n_features];
// Single pass over data: O(samples × features)
for &row_idx in indices {
let g = gradients[row_idx];
let h = hessians[row_idx];
for feature in 0..n_features {
let value = data.get(row_idx, feature);
let bin = find_bin(value, &bin_boundaries[feature]);
histograms[feature][bin].sum_grad += g;
histograms[feature][bin].sum_hess += h;
histograms[feature][bin].count += 1;
}
}
histograms
}
struct HistBin {
sum_grad: f32,
sum_hess: f32,
count: u32,
}
Complexity comparison:
| Approach | Split Finding | Memory |
|---|---|---|
| Exact greedy | O(f × k log k) | O(f × k) per node |
| Histogram | O(f × k + f × bins) | O(f × bins) fixed |
With 256 bins, histogram building is O(k) and split finding is O(256) per feature—independent of sample count.
Finding Splits from Histograms
fn find_best_split_from_histogram(
histogram: &[HistBin],
parent_stats: &SplitStats,
) -> Option<(usize, f32)> { // (bin_index, gain)
let mut left_stats = SplitStats::default();
let mut best: Option<(usize, f32)> = None;
// Linear scan: O(bins)
for (bin_idx, bin) in histogram.iter().enumerate() {
left_stats.sum_grad += bin.sum_grad;
left_stats.sum_hess += bin.sum_hess;
left_stats.count += bin.count as usize;
let right_stats = *parent_stats - left_stats;
let gain = calculate_gain(&left_stats, &right_stats, parent_stats);
if gain > best.map_or(0.0, |(_, g)| g) {
best = Some((bin_idx, gain));
}
}
best
}
The split threshold becomes the bin boundary, not the exact value. With 256 bins, we lose some precision but gain massive speedup.
Distributed Histogram Aggregation
Here’s where histograms shine for distributed training:
flowchart TD
subgraph "Worker 0"
D0["Local Data<br/>Rows 0-10M"]
H0["Local Histogram<br/>256 bins × 100 features"]
end
subgraph "Worker 1"
D1["Local Data<br/>Rows 10M-20M"]
H1["Local Histogram"]
end
subgraph "Worker 2"
D2["Local Data<br/>Rows 20M-30M"]
H2["Local Histogram"]
end
D0 --> H0
D1 --> H1
D2 --> H2
H0 --> AR["All-Reduce<br/>(sum histograms)"]
H1 --> AR
H2 --> AR
AR --> GH["Global Histogram<br/>= H0 + H1 + H2"]
GH --> S0["Worker 0:<br/>Find best split"]
GH --> S1["Worker 1:<br/>Find best split"]
GH --> S2["Worker 2:<br/>Find best split"]
classDef input fill:none,stroke:#60a5fa,stroke-width:2px
classDef result fill:none,stroke:#a78bfa,stroke-width:2px
classDef highlight fill:none,stroke:#f472b6,stroke-width:2px
classDef output fill:none,stroke:#34d399,stroke-width:2px
class D0,D1,D2 input
class H0,H1,H2 result
class AR,GH highlight
class S0,S1,S2 output
Communication cost:
- Per node: O(features × bins × 3 floats)
- With 100 features, 256 bins: 307 KB per tree node
- vs. exact method: O(n) samples × 12 bytes per feature
This is the key insight: histogram aggregation is additive. Each worker computes local histograms, then a single all-reduce produces the global histogram.
All-Reduce: Efficient Global Aggregation
Ring all-reduce is the standard pattern:
flowchart LR
subgraph "Ring All-Reduce"
W0["Worker 0"] --> |"chunk 0"| W1["Worker 1"]
W1 --> |"chunk 1"| W2["Worker 2"]
W2 --> |"chunk 2"| W3["Worker 3"]
W3 --> |"chunk 3"| W0
end
classDef input fill:none,stroke:#60a5fa,stroke-width:2px
class W0,W1,W2,W3 input
Each worker sends 1/k of its data to the next worker, receives 1/k from the previous, and accumulates. After k steps, everyone has the global sum.
Bandwidth: O(data_size) total, not O(workers × data_size).
Computing Bin Boundaries: Quantile Sketches
Before training, we need to determine where the bin boundaries are. We want quantiles (e.g., 256-quantiles) of each feature’s distribution.
For distributed data, we can’t sort globally. Instead, we use approximate quantile sketches:
flowchart TD
subgraph "Each Worker"
L["Local data"]
S["Build local sketch<br/>(compressed summary)"]
end
L --> S
S --> M["Merge sketches<br/>(all-reduce)"]
M --> Q["Extract global quantiles<br/>from merged sketch"]
Q --> B["Broadcast bin boundaries<br/>to all workers"]
classDef input fill:none,stroke:#60a5fa,stroke-width:2px
classDef result fill:none,stroke:#a78bfa,stroke-width:2px
classDef highlight fill:none,stroke:#f472b6,stroke-width:2px
classDef output fill:none,stroke:#34d399,stroke-width:2px
class L input
class S result
class M,Q highlight
class B output
The key property: sketches are mergeable. A sketch of size O(1/ε) can answer quantile queries within ε error, regardless of data size.
XGBoost uses a weighted quantile sketch (Section 3.2 of the original paper) where samples are weighted by their Hessians:
\[\text{rank}(x) = \frac{\sum_{i: x_i < x} h_i}{\sum_i h_i}\]This means regions where the model is uncertain (high Hessian) get finer-grained bins. The sketch algorithm provides theoretical guarantees on quantile accuracy while being mergeable across distributed workers.
The Histogram Subtraction Trick
When building a tree, we compute histograms for many nodes. But there’s redundancy:
\[\text{Histogram}_{\text{parent}} = \text{Histogram}_{\text{left}} + \text{Histogram}_{\text{right}}\]Therefore:
\[\text{Histogram}_{\text{right}} = \text{Histogram}_{\text{parent}} - \text{Histogram}_{\text{left}}\]flowchart TD
P["Parent Node<br/>Histogram: computed"]
P --> L["Left Child<br/>Histogram: computed<br/>(smaller child)"]
P --> R["Right Child<br/>Histogram: parent - left<br/>(FREE!)"]
classDef input fill:none,stroke:#60a5fa,stroke-width:2px
classDef result fill:none,stroke:#a78bfa,stroke-width:2px
classDef done fill:none,stroke:#34d399,stroke-width:2px
class P input
class L result
class R done
Strategy: Always compute the histogram for the smaller child, derive the larger one by subtraction.
This gives a 2x speedup in histogram building with zero additional communication.
Sparsity-Aware Split Finding
Real-world data is often sparse—missing values, zeros from one-hot encoding, or genuine nulls. XGBoost handles this with a sparsity-aware algorithm (Section 3.4 of the original paper) that learns the optimal default direction for missing values at each split.
flowchart TD
subgraph "Split on Feature X"
N["Node: 1000 samples<br/>800 have value, 200 missing"]
N --> L["Left: value < 0.5"]
N --> R["Right: value ≥ 0.5"]
N --> M["Missing: which way?"]
end
M --> |"Try left"| GL["Gain if missing → left"]
M --> |"Try right"| GR["Gain if missing → right"]
GL --> D["Pick direction with<br/>higher gain"]
GR --> D
classDef input fill:none,stroke:#60a5fa,stroke-width:2px
classDef highlight fill:none,stroke:#f472b6,stroke-width:2px
classDef result fill:none,stroke:#a78bfa,stroke-width:2px
classDef output fill:none,stroke:#34d399,stroke-width:2px
class N input
class L,R result
class M,GL,GR highlight
class D output
The algorithm:
fn find_split_with_default_direction(
histogram: &[HistBin],
missing_stats: &SplitStats, // Stats for samples with missing values
parent_stats: &SplitStats,
) -> SplitCandidate {
let mut best = SplitCandidate::default();
// Forward scan: missing values go RIGHT
let mut left = SplitStats::default();
for (bin, stats) in histogram.iter().enumerate() {
left += stats;
let right = *parent_stats - left; // Includes missing
let gain = calculate_gain(&left, &right, parent_stats);
if gain > best.gain {
best = SplitCandidate { bin, gain, default_left: false, .. };
}
}
// Backward scan: missing values go LEFT
let mut right = SplitStats::default();
for (bin, stats) in histogram.iter().enumerate().rev() {
right += stats;
let left = *parent_stats - right; // Includes missing
let gain = calculate_gain(&left, &right, parent_stats);
if gain > best.gain {
best = SplitCandidate { bin, gain, default_left: true, .. };
}
}
best
}
This is powerful: the model learns where missing values should go based on the training data, rather than requiring imputation.
Cache-Aware Block Structure
Memory access patterns matter—a lot. Modern CPUs can process data 100x faster from L1 cache than from main memory. XGBoost addresses this with a block structure for efficient cache utilization (Section 4.1 of the original paper).
The problem with row-major data:
flowchart LR
subgraph "Row-Major Layout"
R["Row 0: f0, f1, f2, f3<br/>Row 1: f0, f1, f2, f3<br/>Row 2: f0, f1, f2, f3"]
end
subgraph "Building histogram for f0"
A["Access row 0, col 0 ✓<br/>Skip f1, f2, f3...<br/>Access row 1, col 0 ✓<br/>Skip f1, f2, f3..."]
end
R --> A
A --> C["Cache misses!<br/>Loading unused data"]
classDef input fill:none,stroke:#60a5fa,stroke-width:2px
classDef progress fill:none,stroke:#fbbf24,stroke-width:2px
classDef negative fill:none,stroke:#f87171,stroke-width:2px
class R input
class A progress
class C negative
XGBoost’s solution: Column Block structure.
flowchart TD
subgraph "Block Structure"
B1["Block 1<br/>Feature 0: sorted (value, row_idx) pairs<br/>Feature 1: sorted (value, row_idx) pairs"]
B2["Block 2<br/>Feature 2: sorted (value, row_idx) pairs<br/>Feature 3: sorted (value, row_idx) pairs"]
end
subgraph "Benefits"
C1["Sequential memory access"]
C2["Pre-sorted for split finding"]
C3["Parallelizable by block"]
end
classDef input fill:none,stroke:#60a5fa,stroke-width:2px
classDef positive fill:none,stroke:#4ade80,stroke-width:2px
class B1,B2 input
class C1,C2,C3 positive
Each block stores a subset of features in column-major, pre-sorted order:
struct ColumnBlock {
// For each feature in this block:
// - Values sorted ascending
// - Corresponding row indices
// - Pre-computed bin indices (for histogram method)
features: Vec<SortedFeature>,
}
struct SortedFeature {
values: Vec<f32>, // Sorted feature values
row_indices: Vec<u32>, // Which row each value came from
bin_indices: Vec<u8>, // Pre-computed histogram bins
}
Benefits:
- Sequential reads when scanning a feature (cache-friendly)
- Pre-sorted eliminates per-node sorting for exact method
- Blocks enable parallelism — different threads process different blocks
- Pre-computed bins avoid repeated binary search during histogram building
Out-of-Core Computation
What if your data doesn’t fit in memory? XGBoost supports external memory training (also described in Section 4.3 of the original paper).
flowchart LR
subgraph "Disk"
D1["Block 1"]
D2["Block 2"]
D3["Block 3"]
D4["Block N"]
end
subgraph "Memory (limited)"
M["Active blocks<br/>(fits in RAM)"]
end
subgraph "Processing"
P["Build histograms<br/>for loaded blocks"]
end
D1 --> |"Load"| M
D2 --> |"Prefetch"| M
M --> P
P --> |"Accumulate"| H["Global histogram"]
classDef input fill:none,stroke:#60a5fa,stroke-width:2px
classDef progress fill:none,stroke:#fbbf24,stroke-width:2px
classDef highlight fill:none,stroke:#f472b6,stroke-width:2px
classDef output fill:none,stroke:#34d399,stroke-width:2px
class D1,D2,D3,D4 input
class M progress
class P highlight
class H output
The strategy:
- Partition data into blocks that fit in memory
- Stream blocks from disk, prefetching the next block while processing current
- Accumulate histograms across all blocks
- Compress blocks on disk (sorted data compresses well)
This trades speed for scale—you can train on datasets 10x larger than RAM, at ~2-3x slowdown.
Level-wise vs Leaf-wise Growth
Two strategies for building trees:
Level-wise (XGBoost default)
flowchart TD
R["Root"]
R --> L1["Level 1"]
R --> L2["Level 1"]
L1 --> L3["Level 2"]
L1 --> L4["Level 2"]
L2 --> L5["Level 2"]
L2 --> L6["Level 2"]
classDef input fill:none,stroke:#60a5fa,stroke-width:2px
classDef result fill:none,stroke:#a78bfa,stroke-width:2px
classDef done fill:none,stroke:#34d399,stroke-width:2px
class R input
class L1,L2 result
class L3,L4,L5,L6 done
Build all nodes at depth d before depth d+1.
Pros:
- Balanced trees
- Predictable communication pattern
- Natural parallelism (all nodes at a level are independent)
Cons:
- May build splits that don’t improve the model
Leaf-wise (LightGBM default)
flowchart TD
R["Root"]
R --> L1["Leaf A<br/>gain=0.5"]
R --> L2["Split here!<br/>gain=0.8"]
L2 --> L3["Leaf B<br/>gain=0.3"]
L2 --> L4["Leaf C<br/>gain=0.9"]
classDef input fill:none,stroke:#60a5fa,stroke-width:2px
classDef result fill:none,stroke:#a78bfa,stroke-width:2px
classDef highlight fill:none,stroke:#f472b6,stroke-width:2px
class R input
class L1,L3 result
class L2,L4 highlight
Always split the leaf with the highest gain.
Pros:
- Faster convergence (fewer nodes for same accuracy)
- Lower loss with fewer splits
Cons:
- Can overfit (deeper trees)
- Less parallelism
- Unbalanced communication
Putting It All Together: The Distributed Training Loop
sequenceDiagram
participant W as Workers (parallel)
participant AR as All-Reduce
Note over W: INIT: Partition data, compute quantile sketches
W->>AR: Local sketches
AR->>W: Global bin boundaries
loop Each Boosting Round
Note over W: Compute local gradients (no communication)
loop Each Tree Node
W->>W: Build local histogram O(local_samples)
W->>AR: Local histogram (300KB)
AR->>W: Global histogram
W->>W: Find best split from histogram
W->>W: Partition samples to children
end
Note over W: Update local predictions (no communication)
end
Communication Summary
| Operation | Frequency | Size | Total |
|---|---|---|---|
| Quantile sketches | Once | O(bins × features) | ~100KB |
| Histogram all-reduce | Per node | O(bins × features × 3) | ~300KB |
| Split broadcast | Per node | O(1) | ~20B |
For a model with 100 trees, depth 6 (63 nodes each):
- Exact method: ~6,300 × O(n) = terabytes
- Histogram method: ~6,300 × 300KB = ~2GB total
Performance Comparison
| Optimization | Speedup | Trade-off |
|---|---|---|
| Histogram binning | 10-50x | Slight precision loss (~256 candidate splits vs all) |
| Histogram subtraction | 2x | None |
| Weighted quantile sketch | Enables distributed | O(1/ε) accuracy guarantee |
| Sparsity-aware splits | 2x on sparse data | None (learns optimal direction) |
| Column block structure | 2-5x | Memory for pre-sorting |
| Combined | 50-200x | Minimal quality impact |
XGBoost at Scale
XGBoost provides multiple backends for distributed training:
| Backend | Use Case | Communication |
|---|---|---|
| Dask | Python data science workflows | In-memory, single machine to cluster |
| Spark | Enterprise data pipelines | HDFS/S3, fault-tolerant |
| Ray | ML platforms, hyperparameter tuning | Actor-based, flexible |
All backends use the same core algorithm: histogram aggregation via all-reduce.
GPU Acceleration
XGBoost’s CUDA implementation parallelizes across samples and features:
- Histogram building: one thread block per feature
- Split finding: parallel reduction
- 10-50x speedup over CPU for large datasets
Other Libraries
Other gradient boosting libraries take different approaches:
-
LightGBM: Leaf-wise tree growth, gradient-based one-side sampling (GOSS), exclusive feature bundling (EFB). Often faster training, especially on large datasets. See LightGBM Features and the LightGBM paper (Ke et al., 2017).
-
CatBoost: Ordered boosting to reduce prediction shift, native categorical feature handling without one-hot encoding, symmetric (oblivious) trees for faster inference. See CatBoost documentation and research papers.
Key Takeaways
-
The bottleneck is split finding — sorting samples per feature per node dominates training time
-
Histograms change the game — O(n log n) → O(n) for split finding, O(n) → O(bins) for communication
-
Distributed training is about minimizing communication — histogram aggregation sends kilobytes instead of gigabytes
-
Approximations are fine — 256 bins lose minimal precision but enable massive scale
-
Data layout matters — column blocks with pre-sorting turn random access into sequential scans
-
Handle missing data natively — learning default directions beats imputation
The path from our Part 2 implementation to production scale isn’t about clever tricks—it’s about recognizing that exact answers aren’t necessary when good-enough answers are orders of magnitude cheaper.
References
Original Paper
The algorithmic innovations discussed here—weighted quantile sketch, sparsity-aware split finding, and cache-aware block structure—are described in detail in the original XGBoost paper:
Official Documentation
| Topic | Link |
|---|---|
| Tree Methods (histogram, approx) | xgboost.readthedocs.io/en/stable/treemethod.html |
| External Memory Training | xgboost.readthedocs.io/en/stable/tutorials/external_memory.html |
| GPU Acceleration | xgboost.readthedocs.io/en/stable/gpu/ |
| Distributed with Dask | xgboost.readthedocs.io/en/stable/tutorials/dask.html |
| Distributed with Spark | xgboost.readthedocs.io/en/stable/tutorials/spark_estimator.html |
| Distributed with Ray | xgboost.readthedocs.io/en/stable/tutorials/ray.html |
| Parameters Reference | xgboost.readthedocs.io/en/stable/parameter.html |
Part 3 of the “XGBoost from Scratch” series. Part 1 covers theory, Part 2 covers implementation.