Building XGBoost from Scratch in Rust, Part 2 — Implementation
In Part 1, we covered the theory behind gradient boosting. Now let’s implement it. We’ll build a gradient boosted tree library in Rust that produces XGBoost-compatible models.
Architecture Overview
graph TD
subgraph "Data Layer"
DMatrix["DMatrix<br/>Row-major feature matrix"]
end
subgraph "Core Algorithm"
Objective["Objective<br/>Loss, gradient, Hessian"]
TreeBuilder["TreeBuilder<br/>Find splits, build tree"]
Tree["Tree<br/>Nodes + prediction"]
end
subgraph "Orchestration"
Booster["GradientBooster<br/>Training loop"]
end
subgraph "I/O"
Serialize["Serialize<br/>XGBoost JSON format"]
end
DMatrix --> Booster
Objective --> Booster
Booster --> TreeBuilder
TreeBuilder --> Tree
Booster --> Serialize
Let’s build each component.
1. The Objective Trait: Encoding the Loss Function
The objective defines what we’re optimizing. It computes gradients and Hessians for each sample.
pub trait Objective: Send + Sync {
/// First derivative: g = ∂L/∂ŷ
fn gradient(&self, pred: f32, label: f32) -> f32;
/// Second derivative: h = ∂²L/∂ŷ²
fn hessian(&self, pred: f32, label: f32) -> f32;
/// Transform raw score to final prediction (e.g., sigmoid)
fn transform_prediction(&self, raw: f32) -> f32 { raw }
/// Loss value for progress reporting
fn loss(&self, pred: f32, label: f32) -> f32;
}
Squared Error (Regression)
For $L = \frac{1}{2}(y - \hat{y})^2$:
\[g = \frac{\partial L}{\partial \hat{y}} = \hat{y} - y\] \[h = \frac{\partial^2 L}{\partial \hat{y}^2} = 1\]pub struct SquaredError;
impl Objective for SquaredError {
fn gradient(&self, pred: f32, label: f32) -> f32 {
pred - label
}
fn hessian(&self, _pred: f32, _label: f32) -> f32 {
1.0 // Constant curvature
}
fn loss(&self, pred: f32, label: f32) -> f32 {
let diff = pred - label;
diff * diff
}
}
Logistic Loss (Binary Classification)
For $L = -[y \log \sigma(\hat{y}) + (1-y)\log(1-\sigma(\hat{y}))]$:
\[g = \sigma(\hat{y}) - y\] \[h = \sigma(\hat{y})(1 - \sigma(\hat{y}))\]pub struct LogisticLoss;
impl Objective for LogisticLoss {
fn gradient(&self, pred: f32, label: f32) -> f32 {
sigmoid(pred) - label
}
fn hessian(&self, pred: f32, _label: f32) -> f32 {
let p = sigmoid(pred);
p * (1.0 - p) // High when uncertain, low when confident
}
fn transform_prediction(&self, raw: f32) -> f32 {
sigmoid(raw) // Convert logit to probability
}
}
fn sigmoid(x: f32) -> f32 {
1.0 / (1.0 + (-x).exp())
}
The Hessian $p(1-p)$ is elegant: samples where the model is confident (p near 0 or 1) contribute less to split decisions. Uncertain samples matter more.
2. Tree Structure: Flattened for Performance
Instead of a recursive tree with heap-allocated children, we use a flat array:
graph TD
subgraph "Logical Tree"
R["Root (0)"] --> L["Left (1)"]
R --> R2["Right (2)"]
L --> LL["Leaf (3)"]
L --> LR["Leaf (4)"]
R2 --> RL["Leaf (5)"]
R2 --> RR["Leaf (6)"]
end
subgraph "Array Storage"
A["[0] [1] [2] [3] [4] [5] [6]"]
end
pub struct TreeNode {
pub split_feature: usize, // Which feature to split on
pub split_value: f32, // Threshold
pub left_child: usize, // Index of left child
pub right_child: usize, // Index of right child
pub is_leaf: bool,
pub weight: f32, // Leaf prediction value
pub gain: f32, // Split quality (for feature importance)
pub default_left: bool, // Direction for NaN values
}
pub struct Tree {
pub nodes: Vec<TreeNode>, // Flat array, root at index 0
pub n_features: usize,
}
Prediction is a simple traversal:
pub fn predict_single(&self, features: &[f32]) -> f32 {
let mut node = &self.nodes[0]; // Start at root
while !node.is_leaf {
let val = features[node.split_feature];
let go_left = val < node.split_value
|| (val.is_nan() && node.default_left);
node = if go_left {
&self.nodes[node.left_child]
} else {
&self.nodes[node.right_child]
};
}
node.weight
}
3. Finding the Best Split
This is the core algorithm. For each node, we find the (feature, threshold) pair that maximizes gain.
The Gain Formula (from Part 1)
\[\text{Gain} = \frac{1}{2} \left[ \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{G^2}{H + \lambda} \right] - \gamma\]pub fn calculate_split_gain(
&self,
left: &SplitStats,
right: &SplitStats,
parent: &SplitStats,
) -> f32 {
let score = |g: f32, h: f32| g * g / (h + self.reg_lambda);
0.5 * (score(left.sum_grad, left.sum_hess)
+ score(right.sum_grad, right.sum_hess)
- score(parent.sum_grad, parent.sum_hess))
- self.gamma
}
Leaf Weight with L1/L2 Regularization
\[w^* = -\frac{\text{sign}(G) \cdot \max(0, |G| - \alpha)}{H + \lambda}\]pub fn calculate_leaf_weight(&self, sum_grad: f32, sum_hess: f32) -> f32 {
let abs_grad = sum_grad.abs();
if abs_grad <= self.reg_alpha {
0.0 // L1 shrinks small gradients to zero
} else {
-sum_grad.signum() * (abs_grad - self.reg_alpha)
/ (sum_hess + self.reg_lambda)
}
}
The Split-Finding Algorithm
For each feature, we sort samples and scan left-to-right, accumulating statistics:
sequenceDiagram
participant Samples as Sorted Samples
participant Left as Left Stats
participant Right as Right Stats
participant Best as Best Split
Note over Samples: Sort by feature value
Note over Right: Initialize: Right = All samples
loop For each sample (left to right)
Samples->>Left: Move sample to left
Samples->>Right: Remove from right
Left->>Best: Calculate gain
Note over Best: Update if better
end
pub fn find_best_split(
&self,
data: &DMatrix,
indices: &[usize],
gradients: &[f32],
hessians: &[f32],
feature_indices: &[usize], // For column subsampling
) -> Option<SplitCandidate> {
let parent_stats = SplitStats {
sum_grad: indices.iter().map(|&i| gradients[i]).sum(),
sum_hess: indices.iter().map(|&i| hessians[i]).sum(),
count: indices.len(),
};
let mut best_split: Option<SplitCandidate> = None;
for &col_idx in feature_indices {
// Collect (value, grad, hess) and sort by value
let mut tuples: Vec<(f32, f32, f32)> = indices
.iter()
.map(|&i| (data.get(i, col_idx), gradients[i], hessians[i]))
.collect();
tuples.sort_by(|a, b| a.0.total_cmp(&b.0));
// Scan left to right, accumulating left statistics
let mut left_stats = SplitStats::default();
for i in 0..tuples.len() - 1 {
let (value, grad, hess) = tuples[i];
left_stats.sum_grad += grad;
left_stats.sum_hess += hess;
left_stats.count += 1;
let right_stats = parent_stats - left_stats;
// Skip if same value (can't split between identical values)
if tuples[i + 1].0 == value {
continue;
}
// Check min_child_weight constraint
if left_stats.sum_hess < self.min_child_weight
|| right_stats.sum_hess < self.min_child_weight {
continue;
}
let gain = self.calculate_split_gain(
&left_stats, &right_stats, &parent_stats
);
if gain > best_split.as_ref().map_or(0.0, |s| s.gain) {
best_split = Some(SplitCandidate {
feature: col_idx,
threshold: (value + tuples[i + 1].0) / 2.0,
gain,
left_stats,
right_stats,
});
}
}
}
best_split
}
Why sum gradients? Each sample contributes independently to the loss. The total loss is $\sum_i L_i$, so the total gradient is $\sum_i g_i$. Samples in the same leaf get the same weight, so we optimize for their collective gradient.
4. Building the Tree Recursively
fn build_tree_impl(
&self,
data: &DMatrix,
indices: &[usize],
gradients: &[f32],
hessians: &[f32],
feature_indices: &[usize],
depth: usize,
) -> Tree {
let mut tree = Tree::new(data.n_cols);
let sum_grad: f32 = indices.iter().map(|&i| gradients[i]).sum();
let sum_hess: f32 = indices.iter().map(|&i| hessians[i]).sum();
// Base case: max depth or too few samples
if depth >= self.max_depth || indices.len() < 2 {
tree.nodes.push(TreeNode {
is_leaf: true,
weight: self.calculate_leaf_weight(sum_grad, sum_hess),
..Default::default()
});
return tree;
}
// Try to find a split
let Some(split) = self.find_best_split(
data, indices, gradients, hessians, feature_indices
) else {
// No valid split → make leaf
tree.nodes.push(TreeNode {
is_leaf: true,
weight: self.calculate_leaf_weight(sum_grad, sum_hess),
..Default::default()
});
return tree;
};
// Partition samples
let (left_idx, right_idx): (Vec<_>, Vec<_>) = indices
.iter()
.copied()
.partition(|&i| data.get(i, split.feature) <= split.threshold);
// Recurse
let left_tree = self.build_tree_impl(
data, &left_idx, gradients, hessians, feature_indices, depth + 1
);
let right_tree = self.build_tree_impl(
data, &right_idx, gradients, hessians, feature_indices, depth + 1
);
// Merge subtrees into flat array
// [root] [left_subtree...] [right_subtree...]
let left_offset = 1;
let right_offset = 1 + left_tree.nodes.len();
tree.nodes.push(TreeNode {
is_leaf: false,
split_feature: split.feature,
split_value: split.threshold,
left_child: left_offset,
right_child: right_offset,
gain: split.gain,
..Default::default()
});
// Append children with adjusted indices
for mut node in left_tree.nodes {
if !node.is_leaf {
node.left_child += left_offset;
node.right_child += left_offset;
}
tree.nodes.push(node);
}
for mut node in right_tree.nodes {
if !node.is_leaf {
node.left_child += right_offset;
node.right_child += right_offset;
}
tree.nodes.push(node);
}
tree
}
5. The Training Loop
flowchart TD
Init["Initialize<br/>preds = [base_score; n_samples]<br/>Pre-allocate grad/hess buffers"]
Init --> Round
subgraph Round ["Each Boosting Round"]
Sample["Row Subsampling<br/>(if subsample < 1.0)"]
Sample --> Grad["Compute Gradients<br/>g[i] = objective.gradient(pred[i], y[i])"]
Grad --> Hess["Compute Hessians<br/>h[i] = objective.hessian(pred[i], y[i])"]
Hess --> Build["Build Tree<br/>(with column subsampling)"]
Build --> Update["Update Predictions<br/>pred[i] += lr × tree.predict(x[i])"]
end
Round --> |"Repeat n_estimators times"| Round
Round --> Done["Return trained model"]
pub fn train(&mut self, data: &DMatrix) -> Result<()> {
let labels = data.labels.as_ref()
.ok_or_else(|| Error::Training("No labels".into()))?;
let mut preds = vec![self.params.base_score; data.n_rows];
let builder = TreeBuilder::new(&self.params);
// Pre-allocate buffers (reused each round)
let mut rng = rand::rng();
let mut all_indices: Vec<usize> = (0..data.n_rows).collect();
let mut grads = vec![0.0f32; data.n_rows];
let mut hessians = vec![0.0f32; data.n_rows];
let n_samples = if self.params.subsample < 1.0 {
((data.n_rows as f32 * self.params.subsample) as usize).max(1)
} else {
data.n_rows
};
for round in 0..self.params.n_estimators {
// Row subsampling with partial shuffle - O(k) not O(n)
let indices: Option<&[usize]> = if self.params.subsample < 1.0 {
partial_shuffle(&mut all_indices, n_samples, &mut rng);
Some(&all_indices[..n_samples])
} else {
None
};
// Compute gradients/Hessians in-place
for ((&pred, &label), (g, h)) in preds.iter()
.zip(labels.iter())
.zip(grads.iter_mut().zip(hessians.iter_mut()))
{
*g = self.objective.gradient(pred, label);
*h = self.objective.hessian(pred, label);
}
// Build tree (column subsampling happens inside)
let tree = builder.build_tree(data, indices, &grads, &hessians);
// Update predictions
let tree_preds = tree.predict(data);
for (p, tp) in preds.iter_mut().zip(tree_preds.iter()) {
*p += self.params.learning_rate * tp;
}
self.trees.push(tree);
}
Ok(())
}
Optimization: Partial Shuffle
For row subsampling, we don’t need a full shuffle. If we want k samples, we only need to randomize k positions:
fn partial_shuffle<T, R: Rng>(slice: &mut [T], k: usize, rng: &mut R) {
let len = slice.len();
for i in 0..k.min(len) {
let j = rng.random_range(i..len);
slice.swap(i, j);
}
}
This is O(k) instead of O(n)—significant when subsampling 10% of a large dataset.
6. XGBoost-Compatible Serialization
Our trees serialize to XGBoost’s exact JSON format:
fn tree_to_xgboost_json(tree: &Tree, tree_id: usize) -> Value {
let n_nodes = tree.nodes.len();
let mut left_children = Vec::with_capacity(n_nodes);
let mut right_children = Vec::with_capacity(n_nodes);
let mut split_indices = Vec::with_capacity(n_nodes);
let mut split_conditions = Vec::with_capacity(n_nodes);
// ... etc
for node in &tree.nodes {
if node.is_leaf {
left_children.push(-1); // XGBoost uses -1 for leaves
right_children.push(-1);
} else {
left_children.push(node.left_child as i32);
right_children.push(node.right_child as i32);
}
// ...
}
json!({
"left_children": left_children,
"right_children": right_children,
"split_indices": split_indices,
"split_conditions": split_conditions,
// ...
})
}
This enables perfect interoperability: train in Rust, load in Python’s XGBoost, and vice versa.
Putting It All Together
graph LR
subgraph Input
CSV["CSV Data"]
end
subgraph Training
DM["DMatrix"]
GB["GradientBooster"]
OBJ["Objective"]
TB["TreeBuilder"]
end
subgraph Output
JSON["XGBoost JSON"]
PRED["Predictions"]
end
CSV --> DM
DM --> GB
OBJ --> GB
GB --> TB
TB --> GB
GB --> JSON
GB --> PRED
Example Usage
use rgboost::{DMatrix, GradientBooster, TrainingParams};
fn main() -> Result<()> {
// Load data
let data = DMatrix::from_csv("train.csv", true)?
.with_labels(labels)?;
// Configure
let params = TrainingParams {
n_estimators: 100,
max_depth: 6,
learning_rate: 0.3,
objective: "reg:squarederror".into(),
..Default::default()
};
// Train
let mut booster = GradientBooster::new(params);
booster.train(&data)?;
// Predict
let predictions = booster.predict(&test_data);
// Save (XGBoost-compatible)
booster.save_model("model.json")?;
Ok(())
}
Key Implementation Insights
1. Gradients Are Per-Sample, Summed Per-Node
Sample i: g_i = ∂L_i/∂ŷ_i (how wrong is this prediction?)
Node: G = Σ g_i (total wrongness in this node)
The sum works because loss is additive and all samples in a leaf get the same weight.
2. Hessians Act as Weights
For logistic loss, $h = p(1-p)$. Confident predictions (p near 0 or 1) have low Hessian, so they influence splits less. The algorithm naturally focuses on hard examples.
3. Regularization Everywhere
- λ (reg_lambda): L2 on leaf weights → prevents extreme values
- α (reg_alpha): L1 on leaf weights → encourages sparsity
- γ (gamma): Minimum gain → prevents trivial splits
- max_depth: Structural constraint
- min_child_weight: Minimum Hessian sum → prevents tiny leaves
4. Column Subsampling Per Tree
let feature_indices: Vec<usize> = if self.colsample_bytree < 1.0 {
// Random subset of features for this tree
partial_shuffle(&mut all_features, n_features, &mut rng);
all_features[..n_features].to_vec()
} else {
(0..data.n_cols).collect()
};
Each tree sees only a random subset of features. This decorrelates trees and improves generalization.
What’s Missing (Future Work)
- Multiclass classification: Requires K trees per round for K classes
- Histogram-based splitting: O(n) instead of O(n log n) per feature
- GPU acceleration: Parallel split finding
- Early stopping: Validation-based termination
But the core algorithm—the part that actually makes XGBoost work—is complete and compatible.
Part 2 of the “XGBoost from Scratch” series. Part 1 covers theory, Part 3 covers scaling to terabytes.