Deep Learning from Scratch in Rust, Part 2 — Layers, Models, and Loss

In Part 1, we built tensor autodiff — gradients flow through multi-dimensional arrays with broadcasting and reductions handled correctly. But we still don’t have a neural network.

What’s missing? The building blocks: layers that encapsulate learnable parameters, models that compose layers, and loss functions that define what “correct” means.

Today we bridge the gap from “autodiff engine” to “trainable model.”

Variables vs Constants

Not all tensors are equal. Some hold input data (fixed during backward pass), others hold model weights (we need their gradients). The distinction is simple:

// Variable: tracked for gradients
let weight = Tensor::var("weight", CpuBackend::from_vec(data, shape));

// Constant: not tracked
let input = Tensor::constant(CpuBackend::from_vec(data, shape));

Tensor::var() creates a named variable node in the computation graph. When we call backward(), we get gradients for all variables that influenced the output.

The Linear Layer

The most fundamental layer: a fully-connected (dense) layer.

\[y = xW^T + b\]

Where:

$x$: input of shape [batch, in_features]
$W$: weight matrix of shape [out_features, in_features]
$b$: bias vector of shape [out_features]
$y$: output of shape [batch, out_features]

graph LR
    subgraph Input
        x["x<br/>[batch, in_features]"]
    end

    subgraph Parameters
        W["W<br/>[out, in]"]
        b["b<br/>[out]"]
    end

    subgraph Operations
        matmul["matmul<br/>x @ Wᵀ"]
        add["add<br/>+ bias"]
    end

    subgraph Output
        y["y<br/>[batch, out_features]"]
    end

    x --> matmul
    W --> matmul
    matmul --> add
    b --> add
    add --> y

    classDef input fill:none,stroke:#60a5fa,stroke-width:2px
    classDef param fill:none,stroke:#a78bfa,stroke-width:2px
    classDef output fill:none,stroke:#34d399,stroke-width:2px
    class x input
    class W,b param
    class y output

Each input connects to every output through learned weights — that’s why it’s called “fully connected.”

use ad_backend_cpu::CpuBackend;
use ad_tensor::prelude::*;
use rand::Rng;

pub struct Linear {
    /// Weight matrix [out_features, in_features]
    pub weight: Tensor<CpuBackend>,
    /// Bias vector [out_features]
    pub bias: Option<Tensor<CpuBackend>>,
}

impl Linear {
    pub fn new(in_features: usize, out_features: usize, bias: bool) -> Self {
        let mut rng = rand::thread_rng();

        // Kaiming initialization: std = sqrt(2 / fan_in)
        let std = (2.0 / in_features as f32).sqrt();

        let weight_data: Vec<f32> = (0..out_features * in_features)
            .map(|_| rng.gen::<f32>() * std * 2.0 - std)
            .collect();

        let weight = Tensor::var(
            "weight",
            CpuBackend::from_vec(weight_data, Shape::new(vec![out_features, in_features])),
        );

        let bias = if bias {
            Some(Tensor::var(
                "bias",
                CpuBackend::from_vec(vec![0.0; out_features], Shape::new(vec![out_features])),
            ))
        } else {
            None
        };

        Linear { weight, bias }
    }

    pub fn forward(&self, x: &Tensor<CpuBackend>) -> Tensor<CpuBackend> {
        // x @ W^T
        let y = x.matmul(&self.weight.t());

        // Add bias if present
        match &self.bias {
            Some(bias) => &y + bias,
            None => y,
        }
    }

    pub fn parameters(&self) -> Vec<&Tensor<CpuBackend>> {
        let mut params = vec![&self.weight];
        if let Some(ref b) = self.bias {
            params.push(b);
        }
        params
    }
}

Notice: the layer is tied to CpuBackend because initialization uses rand. The forward pass itself would work with any backend, but creating random weights requires CPU access. For GPU training, you’d initialize on CPU then transfer.

Why Kaiming Initialization?

Bad initialization kills training. If weights are too large, activations explode. Too small, gradients vanish.

Kaiming (He) initialization is designed for ReLU networks:

\[W \sim \mathcal{N}\left(0, \sqrt{\frac{2}{n_{in}}}\right)\]

The $\sqrt{2}$ accounts for ReLU zeroing half the values. This keeps variance stable as signals propagate through layers.

Activation Functions

Activations introduce non-linearity. Without them, stacking linear layers is pointless — the composition of linear functions is linear.

■ ReLU ■ Sigmoid ■ Tanh

Notice how each activation squashes or clips the input differently:

ReLU: Zero for negatives, linear for positives. Simple, fast, but “dead neurons” can occur.
Sigmoid: Squashes to (0, 1). Good for probabilities, but gradients vanish at extremes.
Tanh: Squashes to (-1, 1). Zero-centered, but same vanishing gradient problem.

Activations are pure tensor operations — they work with any backend. We implement them as both tensor methods and standalone functions:

// In your tensor implementation
impl<B: Backend> Tensor<B> {
    pub fn relu(&self) -> Tensor<B> {
        // max(0, x)
        self.maximum(&Tensor::zeros(self.shape()))
    }

    pub fn sigmoid(&self) -> Tensor<B> {
        // 1 / (1 + exp(-x))
        let one = Tensor::ones(self.shape());
        &one / &(&one + (-self).exp())
    }

    pub fn tanh(&self) -> Tensor<B> {
        // Built-in or: (exp(x) - exp(-x)) / (exp(x) + exp(-x))
    }
}

Each activation needs a backward implementation in the autodiff engine:

TensorOp::ReLU => {
    // d/dx ReLU(x) = 1 if x > 0, else 0
    let mask = B::gt(children[0].data(), &B::scalar(0.0));
    vec![Some(B::mul(upstream_grad, &mask))]
}

TensorOp::Sigmoid => {
    // d/dx σ(x) = σ(x)(1 - σ(x))
    let s = output.data();
    let one_minus_s = B::sub(&B::ones(s.shape()), s);
    let local_grad = B::mul(s, &one_minus_s);
    vec![Some(B::mul(upstream_grad, &local_grad))]
}

TensorOp::Tanh => {
    // d/dx tanh(x) = 1 - tanh²(x)
    let t = output.data();
    let t_sq = B::mul(t, t);
    let local_grad = B::sub(&B::ones(t.shape()), &t_sq);
    vec![Some(B::mul(upstream_grad, &local_grad))]
}

Log-Softmax for Classification

Softmax converts logits to probabilities. But we rarely use softmax directly — we use log-softmax for numerical stability:

pub fn log_softmax<B: Backend>(logits: &Tensor<B>) -> Tensor<B> {
    // log(softmax(x)) = x - log(sum(exp(x)))
    // With max-subtraction for stability:
    let ndim = logits.ndim();
    let axis = if ndim > 0 { ndim - 1 } else { 0 };

    let max_logits = logits.max(Some(&[axis]), true);
    let shifted = logits - &max_logits;
    let log_sum_exp = shifted.exp().sum(Some(&[axis]), true).log();
    shifted - log_sum_exp
}

Building Models

Without a formal Module trait, we compose layers manually. This is actually clearer:

graph LR
    subgraph "Input"
        x["x<br/>[batch, input_dim]"]
    end

    subgraph "Hidden Layer"
        l1["Linear<br/>(input → hidden)"]
        relu["ReLU"]
    end

    subgraph "Output Layer"
        l2["Linear<br/>(hidden → output)"]
    end

    subgraph "Output"
        y["y<br/>[batch, output_dim]"]
    end

    x --> l1 --> relu --> l2 --> y

    classDef input fill:none,stroke:#60a5fa,stroke-width:2px
    classDef layer fill:none,stroke:#a78bfa,stroke-width:2px
    classDef activation fill:none,stroke:#f472b6,stroke-width:2px
    classDef output fill:none,stroke:#34d399,stroke-width:2px
    class x input
    class l1,l2 layer
    class relu activation
    class y output

Data flows left-to-right: input → linear transform → non-linearity → linear transform → output. This is the simplest multi-layer perceptron (MLP).

pub struct MLP {
    l1: Linear,
    l2: Linear,
}

impl MLP {
    pub fn new(input_dim: usize, hidden_dim: usize, output_dim: usize) -> Self {
        MLP {
            l1: Linear::new(input_dim, hidden_dim, true),
            l2: Linear::new(hidden_dim, output_dim, true),
        }
    }

    pub fn forward(&self, x: &Tensor<CpuBackend>) -> Tensor<CpuBackend> {
        let h = self.l1.forward(x).relu();
        self.l2.forward(&h)
    }

    pub fn parameters(&self) -> Vec<&Tensor<CpuBackend>> {
        let mut params = self.l1.parameters();
        params.extend(self.l2.parameters());
        params
    }
}

No traits, no dynamic dispatch, no Box<dyn Module>. Just structs and methods. The Rust compiler can inline everything.

Loss Functions

Loss functions measure how wrong predictions are. They’re the starting point of backpropagation.

Unlike layers (which need random initialization), loss functions are pure tensor operations — they work with any backend.

Mean Squared Error (MSE)

For regression tasks:

\[\mathcal{L}_{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\]

pub fn mse_loss<B: Backend>(pred: &Tensor<B>, target: &Tensor<B>) -> Tensor<B> {
    let diff = pred - target;
    (&diff * &diff).mean(None, false)
}

The gradient is straightforward: $\frac{\partial \mathcal{L}}{\partial \hat{y}_i} = \frac{2}{n}(\hat{y}_i - y_i)$

Binary Cross-Entropy with Logits

For binary classification, we take raw logits (pre-sigmoid) for numerical stability:

pub fn binary_cross_entropy_with_logits<B: Backend>(
    logits: &Tensor<B>,
    targets: &Tensor<B>,
) -> Tensor<B> {
    // Numerically stable: max(logits, 0) - logits * targets + log(1 + exp(-|logits|))
    let relu_logits = logits.relu();
    let logits_targets = logits * targets;
    let abs_logits = logits.maximum(&(-logits));
    let one = Tensor::<B>::ones(logits.shape());
    let log_term = (&one + (-&abs_logits).exp()).log();

    let loss = relu_logits - logits_targets + log_term;
    loss.mean(None, false)
}

Soft Cross-Entropy Loss

For multi-class classification with soft labels (probabilities or one-hot):

pub fn soft_cross_entropy_loss<B: Backend>(
    logits: &Tensor<B>,  // [batch, num_classes]
    targets: &Tensor<B>, // [batch, num_classes] probabilities
) -> Tensor<B> {
    let log_probs = log_softmax(logits);

    // -sum(targets * log_probs) over classes, mean over batch
    let neg_log_probs = -(targets * &log_probs);
    neg_log_probs.sum(Some(&[1]), false).mean(None, false)
}

The beautiful gradient: for softmax + cross-entropy, $\frac{\partial \mathcal{L}}{\partial z_i} = \hat{y}_i - y_i$ (prediction minus target).

Visualizing Loss Landscapes

Different losses create different optimization landscapes:

Cross-entropy has steeper gradients for wrong predictions, driving faster learning.

Key difference:

MSE: Gradient approaches zero as prediction approaches 0 (confident and wrong). Training stalls.
Cross-entropy: Gradient explodes as prediction approaches 0. Strong signal to correct mistakes.

This is why cross-entropy is preferred for classification.

When to Use Which Loss

Task	Output	Activation	Loss
Regression	Continuous values	None (linear)	`mse_loss`
Binary classification	Probability	Sigmoid	`binary_cross_entropy_with_logits`
Multi-class (single label)	Class probabilities	Softmax	`soft_cross_entropy_loss`

Putting It Together

A complete training step:

use ad_tensor::prelude::*;
use ad_backend_cpu::CpuBackend;
use ad_nn::{Linear, mse_loss, Adam};

// Create a simple network
let mut l1 = Linear::new(2, 8, true);
let mut l2 = Linear::new(8, 1, true);
let mut opt = Adam::new(0.01);

// Training data: XOR problem
let inputs = vec![
    vec![0.0, 0.0], vec![0.0, 1.0],
    vec![1.0, 0.0], vec![1.0, 1.0],
];
let targets = vec![0.0, 1.0, 1.0, 0.0];

for epoch in 0..1000 {
    for (input, &target) in inputs.iter().zip(&targets) {
        // Forward pass
        let x = Tensor::var("x", CpuBackend::from_vec(input.clone(), Shape::new(vec![1, 2])));
        let y = Tensor::constant(CpuBackend::from_vec(vec![target], Shape::new(vec![1, 1])));

        let h = l1.forward(&x).relu();
        let pred = l2.forward(&h);
        let loss = mse_loss(&pred, &y);

        // Backward pass
        let grads = loss.backward();

        // Update parameters
        opt.step(&mut l1.weight, grads.wrt(&l1.weight).unwrap());
        if let Some(ref mut bias) = l1.bias {
            opt.step(bias, grads.wrt(bias).unwrap());
        }
        opt.step(&mut l2.weight, grads.wrt(&l2.weight).unwrap());
        if let Some(ref mut bias) = l2.bias {
            opt.step(bias, grads.wrt(bias).unwrap());
        }
    }
}

The gradient for every parameter flows automatically through the computation graph — from loss, through the layers, to the weights.

graph LR
    subgraph "Forward Pass →"
        direction LR
        x1["Input"] --> h1["Hidden"] --> o1["Output"] --> loss1["Loss"]
    end

    subgraph "← Backward Pass"
        direction RL
        loss2["∂L/∂L = 1"] --> o2["∂L/∂output"] --> h2["∂L/∂hidden"] --> w2["∂L/∂weights"]
    end

    loss1 -.-> loss2

    classDef forward fill:none,stroke:#60a5fa,stroke-width:2px
    classDef backward fill:none,stroke:#f472b6,stroke-width:2px
    class x1,h1,o1,loss1 forward
    class loss2,o2,h2,w2 backward

What’s Next

We have models with parameters and loss functions that produce gradients. But gradients alone don’t train anything. We need optimizers to turn gradients into parameter updates.

Part 3 implements SGD, Momentum, and Adam — the algorithms that make learning happen.

Part 2 of the “Deep Learning from Scratch in Rust” series. Part 1 covers tensor gradients, Part 3 covers optimizers.