Deep Learning from Scratch in Rust, Part 2 — Layers, Models, and Loss
In Part 1, we built tensor autodiff — gradients flow through multi-dimensional arrays with broadcasting and reductions handled correctly. But we still don’t have a neural network.
What’s missing? The building blocks: layers that encapsulate learnable parameters, models that compose layers, and loss functions that define what “correct” means.
Today we bridge the gap from “autodiff engine” to “trainable model.”
Variables vs Constants
Not all tensors are equal. Some hold input data (fixed during backward pass), others hold model weights (we need their gradients). The distinction is simple:
// Variable: tracked for gradients
let weight = Tensor::var("weight", CpuBackend::from_vec(data, shape));
// Constant: not tracked
let input = Tensor::constant(CpuBackend::from_vec(data, shape));
Tensor::var() creates a named variable node in the computation graph. When we call backward(), we get gradients for all variables that influenced the output.
The Linear Layer
The most fundamental layer: a fully-connected (dense) layer.
\[y = xW^T + b\]Where:
- $x$: input of shape
[batch, in_features] - $W$: weight matrix of shape
[out_features, in_features] - $b$: bias vector of shape
[out_features] - $y$: output of shape
[batch, out_features]
graph LR
subgraph Input
x["x<br/>[batch, in_features]"]
end
subgraph Parameters
W["W<br/>[out, in]"]
b["b<br/>[out]"]
end
subgraph Operations
matmul["matmul<br/>x @ Wᵀ"]
add["add<br/>+ bias"]
end
subgraph Output
y["y<br/>[batch, out_features]"]
end
x --> matmul
W --> matmul
matmul --> add
b --> add
add --> y
classDef input fill:none,stroke:#60a5fa,stroke-width:2px
classDef param fill:none,stroke:#a78bfa,stroke-width:2px
classDef output fill:none,stroke:#34d399,stroke-width:2px
class x input
class W,b param
class y output
Each input connects to every output through learned weights — that’s why it’s called “fully connected.”
use ad_backend_cpu::CpuBackend;
use ad_tensor::prelude::*;
use rand::Rng;
pub struct Linear {
/// Weight matrix [out_features, in_features]
pub weight: Tensor<CpuBackend>,
/// Bias vector [out_features]
pub bias: Option<Tensor<CpuBackend>>,
}
impl Linear {
pub fn new(in_features: usize, out_features: usize, bias: bool) -> Self {
let mut rng = rand::thread_rng();
// Kaiming initialization: std = sqrt(2 / fan_in)
let std = (2.0 / in_features as f32).sqrt();
let weight_data: Vec<f32> = (0..out_features * in_features)
.map(|_| rng.gen::<f32>() * std * 2.0 - std)
.collect();
let weight = Tensor::var(
"weight",
CpuBackend::from_vec(weight_data, Shape::new(vec![out_features, in_features])),
);
let bias = if bias {
Some(Tensor::var(
"bias",
CpuBackend::from_vec(vec![0.0; out_features], Shape::new(vec![out_features])),
))
} else {
None
};
Linear { weight, bias }
}
pub fn forward(&self, x: &Tensor<CpuBackend>) -> Tensor<CpuBackend> {
// x @ W^T
let y = x.matmul(&self.weight.t());
// Add bias if present
match &self.bias {
Some(bias) => &y + bias,
None => y,
}
}
pub fn parameters(&self) -> Vec<&Tensor<CpuBackend>> {
let mut params = vec![&self.weight];
if let Some(ref b) = self.bias {
params.push(b);
}
params
}
}
Notice: the layer is tied to CpuBackend because initialization uses rand. The forward pass itself would work with any backend, but creating random weights requires CPU access. For GPU training, you’d initialize on CPU then transfer.
Why Kaiming Initialization?
Bad initialization kills training. If weights are too large, activations explode. Too small, gradients vanish.
Kaiming (He) initialization is designed for ReLU networks:
\[W \sim \mathcal{N}\left(0, \sqrt{\frac{2}{n_{in}}}\right)\]The $\sqrt{2}$ accounts for ReLU zeroing half the values. This keeps variance stable as signals propagate through layers.
Activation Functions
Activations introduce non-linearity. Without them, stacking linear layers is pointless — the composition of linear functions is linear.
Notice how each activation squashes or clips the input differently:
- ReLU: Zero for negatives, linear for positives. Simple, fast, but “dead neurons” can occur.
- Sigmoid: Squashes to (0, 1). Good for probabilities, but gradients vanish at extremes.
- Tanh: Squashes to (-1, 1). Zero-centered, but same vanishing gradient problem.
Activations are pure tensor operations — they work with any backend. We implement them as both tensor methods and standalone functions:
// In your tensor implementation
impl<B: Backend> Tensor<B> {
pub fn relu(&self) -> Tensor<B> {
// max(0, x)
self.maximum(&Tensor::zeros(self.shape()))
}
pub fn sigmoid(&self) -> Tensor<B> {
// 1 / (1 + exp(-x))
let one = Tensor::ones(self.shape());
&one / &(&one + (-self).exp())
}
pub fn tanh(&self) -> Tensor<B> {
// Built-in or: (exp(x) - exp(-x)) / (exp(x) + exp(-x))
}
}
Each activation needs a backward implementation in the autodiff engine:
TensorOp::ReLU => {
// d/dx ReLU(x) = 1 if x > 0, else 0
let mask = B::gt(children[0].data(), &B::scalar(0.0));
vec![Some(B::mul(upstream_grad, &mask))]
}
TensorOp::Sigmoid => {
// d/dx σ(x) = σ(x)(1 - σ(x))
let s = output.data();
let one_minus_s = B::sub(&B::ones(s.shape()), s);
let local_grad = B::mul(s, &one_minus_s);
vec![Some(B::mul(upstream_grad, &local_grad))]
}
TensorOp::Tanh => {
// d/dx tanh(x) = 1 - tanh²(x)
let t = output.data();
let t_sq = B::mul(t, t);
let local_grad = B::sub(&B::ones(t.shape()), &t_sq);
vec![Some(B::mul(upstream_grad, &local_grad))]
}
Log-Softmax for Classification
Softmax converts logits to probabilities. But we rarely use softmax directly — we use log-softmax for numerical stability:
pub fn log_softmax<B: Backend>(logits: &Tensor<B>) -> Tensor<B> {
// log(softmax(x)) = x - log(sum(exp(x)))
// With max-subtraction for stability:
let ndim = logits.ndim();
let axis = if ndim > 0 { ndim - 1 } else { 0 };
let max_logits = logits.max(Some(&[axis]), true);
let shifted = logits - &max_logits;
let log_sum_exp = shifted.exp().sum(Some(&[axis]), true).log();
shifted - log_sum_exp
}
Building Models
Without a formal Module trait, we compose layers manually. This is actually clearer:
graph LR
subgraph "Input"
x["x<br/>[batch, input_dim]"]
end
subgraph "Hidden Layer"
l1["Linear<br/>(input → hidden)"]
relu["ReLU"]
end
subgraph "Output Layer"
l2["Linear<br/>(hidden → output)"]
end
subgraph "Output"
y["y<br/>[batch, output_dim]"]
end
x --> l1 --> relu --> l2 --> y
classDef input fill:none,stroke:#60a5fa,stroke-width:2px
classDef layer fill:none,stroke:#a78bfa,stroke-width:2px
classDef activation fill:none,stroke:#f472b6,stroke-width:2px
classDef output fill:none,stroke:#34d399,stroke-width:2px
class x input
class l1,l2 layer
class relu activation
class y output
Data flows left-to-right: input → linear transform → non-linearity → linear transform → output. This is the simplest multi-layer perceptron (MLP).
pub struct MLP {
l1: Linear,
l2: Linear,
}
impl MLP {
pub fn new(input_dim: usize, hidden_dim: usize, output_dim: usize) -> Self {
MLP {
l1: Linear::new(input_dim, hidden_dim, true),
l2: Linear::new(hidden_dim, output_dim, true),
}
}
pub fn forward(&self, x: &Tensor<CpuBackend>) -> Tensor<CpuBackend> {
let h = self.l1.forward(x).relu();
self.l2.forward(&h)
}
pub fn parameters(&self) -> Vec<&Tensor<CpuBackend>> {
let mut params = self.l1.parameters();
params.extend(self.l2.parameters());
params
}
}
No traits, no dynamic dispatch, no Box<dyn Module>. Just structs and methods. The Rust compiler can inline everything.
Loss Functions
Loss functions measure how wrong predictions are. They’re the starting point of backpropagation.
Unlike layers (which need random initialization), loss functions are pure tensor operations — they work with any backend.
Mean Squared Error (MSE)
For regression tasks:
\[\mathcal{L}_{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\]pub fn mse_loss<B: Backend>(pred: &Tensor<B>, target: &Tensor<B>) -> Tensor<B> {
let diff = pred - target;
(&diff * &diff).mean(None, false)
}
The gradient is straightforward: $\frac{\partial \mathcal{L}}{\partial \hat{y}_i} = \frac{2}{n}(\hat{y}_i - y_i)$
Binary Cross-Entropy with Logits
For binary classification, we take raw logits (pre-sigmoid) for numerical stability:
pub fn binary_cross_entropy_with_logits<B: Backend>(
logits: &Tensor<B>,
targets: &Tensor<B>,
) -> Tensor<B> {
// Numerically stable: max(logits, 0) - logits * targets + log(1 + exp(-|logits|))
let relu_logits = logits.relu();
let logits_targets = logits * targets;
let abs_logits = logits.maximum(&(-logits));
let one = Tensor::<B>::ones(logits.shape());
let log_term = (&one + (-&abs_logits).exp()).log();
let loss = relu_logits - logits_targets + log_term;
loss.mean(None, false)
}
Soft Cross-Entropy Loss
For multi-class classification with soft labels (probabilities or one-hot):
pub fn soft_cross_entropy_loss<B: Backend>(
logits: &Tensor<B>, // [batch, num_classes]
targets: &Tensor<B>, // [batch, num_classes] probabilities
) -> Tensor<B> {
let log_probs = log_softmax(logits);
// -sum(targets * log_probs) over classes, mean over batch
let neg_log_probs = -(targets * &log_probs);
neg_log_probs.sum(Some(&[1]), false).mean(None, false)
}
The beautiful gradient: for softmax + cross-entropy, $\frac{\partial \mathcal{L}}{\partial z_i} = \hat{y}_i - y_i$ (prediction minus target).
Visualizing Loss Landscapes
Different losses create different optimization landscapes:
Cross-entropy has steeper gradients for wrong predictions, driving faster learning.
Key difference:
- MSE: Gradient approaches zero as prediction approaches 0 (confident and wrong). Training stalls.
- Cross-entropy: Gradient explodes as prediction approaches 0. Strong signal to correct mistakes.
This is why cross-entropy is preferred for classification.
When to Use Which Loss
| Task | Output | Activation | Loss |
|---|---|---|---|
| Regression | Continuous values | None (linear) | mse_loss |
| Binary classification | Probability | Sigmoid | binary_cross_entropy_with_logits |
| Multi-class (single label) | Class probabilities | Softmax | soft_cross_entropy_loss |
Putting It Together
A complete training step:
use ad_tensor::prelude::*;
use ad_backend_cpu::CpuBackend;
use ad_nn::{Linear, mse_loss, Adam};
// Create a simple network
let mut l1 = Linear::new(2, 8, true);
let mut l2 = Linear::new(8, 1, true);
let mut opt = Adam::new(0.01);
// Training data: XOR problem
let inputs = vec![
vec![0.0, 0.0], vec![0.0, 1.0],
vec![1.0, 0.0], vec![1.0, 1.0],
];
let targets = vec![0.0, 1.0, 1.0, 0.0];
for epoch in 0..1000 {
for (input, &target) in inputs.iter().zip(&targets) {
// Forward pass
let x = Tensor::var("x", CpuBackend::from_vec(input.clone(), Shape::new(vec![1, 2])));
let y = Tensor::constant(CpuBackend::from_vec(vec![target], Shape::new(vec![1, 1])));
let h = l1.forward(&x).relu();
let pred = l2.forward(&h);
let loss = mse_loss(&pred, &y);
// Backward pass
let grads = loss.backward();
// Update parameters
opt.step(&mut l1.weight, grads.wrt(&l1.weight).unwrap());
if let Some(ref mut bias) = l1.bias {
opt.step(bias, grads.wrt(bias).unwrap());
}
opt.step(&mut l2.weight, grads.wrt(&l2.weight).unwrap());
if let Some(ref mut bias) = l2.bias {
opt.step(bias, grads.wrt(bias).unwrap());
}
}
}
The gradient for every parameter flows automatically through the computation graph — from loss, through the layers, to the weights.
graph LR
subgraph "Forward Pass →"
direction LR
x1["Input"] --> h1["Hidden"] --> o1["Output"] --> loss1["Loss"]
end
subgraph "← Backward Pass"
direction RL
loss2["∂L/∂L = 1"] --> o2["∂L/∂output"] --> h2["∂L/∂hidden"] --> w2["∂L/∂weights"]
end
loss1 -.-> loss2
classDef forward fill:none,stroke:#60a5fa,stroke-width:2px
classDef backward fill:none,stroke:#f472b6,stroke-width:2px
class x1,h1,o1,loss1 forward
class loss2,o2,h2,w2 backward
What’s Next
We have models with parameters and loss functions that produce gradients. But gradients alone don’t train anything. We need optimizers to turn gradients into parameter updates.
Part 3 implements SGD, Momentum, and Adam — the algorithms that make learning happen.
Part 2 of the “Deep Learning from Scratch in Rust” series. Part 1 covers tensor gradients, Part 3 covers optimizers.