Deep Learning from Scratch in Rust, Part 1 — Tensor Gradients

In the Autodiff series, we built a working autodiff engine for scalar functions. Clean, elegant, and… completely impractical. But building it was so much fun that I decided to take it all the way — from toy scalar engine to a real deep learning framework.

Real neural networks don’t operate on individual numbers. They process tensors — multi-dimensional arrays where a single forward pass might involve millions of values. Today we’ll generalize our scalar engine to tensors and discover the new problems that emerge.

Spoiler: broadcasting is where the elegance gets messy.

The Scalar-to-Tensor Leap

Our scalar engine had a simple mental model:

struct Node {
    value: f64,          // One number
    children: Vec<Expr>,
    op: Op,
}

The tensor version looks similar:

struct TensorNode<B: Backend> {
    data: B::Tensor,     // Many numbers with a shape
    children: Vec<Tensor<B>>,
    op: TensorOp,
}

At first glance, we just swap f64 for a tensor type. All the ops become element-wise. Easy, right?

Not quite. But first — what’s with the B: Backend generic?

The Backend Abstraction

Our scalar engine had one implementation. Tensor operations have many: CPU loops, SIMD vectorization, GPU shaders, Apple’s Metal, CUDA. We don’t want to rewrite the autodiff logic for each.

The solution is a backend trait that abstracts tensor operations:

pub trait Backend: Clone + 'static {
    type Tensor: TensorData;

    // Creation
    fn from_vec(data: Vec<f32>, shape: Shape) -> Self::Tensor;
    fn zeros(shape: &Shape) -> Self::Tensor;
    fn ones(shape: &Shape) -> Self::Tensor;

    // Element-wise ops
    fn add(a: &Self::Tensor, b: &Self::Tensor) -> Self::Tensor;
    fn mul(a: &Self::Tensor, b: &Self::Tensor) -> Self::Tensor;
    fn exp(a: &Self::Tensor) -> Self::Tensor;

    // Reductions
    fn sum(a: &Self::Tensor, axes: Option<&[usize]>, keepdims: bool) -> Self::Tensor;

    // Linear algebra
    fn matmul(a: &Self::Tensor, b: &Self::Tensor) -> Self::Tensor;
    fn transpose(a: &Self::Tensor, axes: Option<&[usize]>) -> Self::Tensor;

    // ... more ops
}

A CPU backend implements these with standard loops. A Metal backend dispatches to GPU shaders. The autodiff engine doesn’t care — it just calls B::add() and the right thing happens.

pub struct CpuBackend;
pub struct MetalBackend;

impl Backend for CpuBackend {
    type Tensor = CpuTensor;  // Vec<f32> + Shape

    fn add(a: &CpuTensor, b: &CpuTensor) -> CpuTensor {
        // Element-wise loop on CPU
    }
}

impl Backend for MetalBackend {
    type Tensor = MetalTensor;  // GPU buffer handle + Shape

    fn add(a: &MetalTensor, b: &MetalTensor) -> MetalTensor {
        // Dispatch Metal compute shader
    }
}

This pattern lets us write the autodiff logic once and run it anywhere. The gradient computation, topological sort, chain rule — all backend-agnostic. Only the raw tensor math changes.

Now, back to the new problems that tensors introduce.

Problem 1: Shape Tracking

Scalars don’t have shape. Tensors do, and shapes must be compatible:

let a = tensor([2, 3]);  // 2×3 matrix
let b = tensor([3, 4]);  // 3×4 matrix

let c = a + b;  // ERROR: shapes don't match
let d = a.matmul(&b);  // OK: [2, 3] @ [3, 4] → [2, 4]

Every operation needs shape logic:

Element-wise: shapes must match (or be broadcastable)
MatMul: inner dimensions must match
Reductions: output shape depends on which axes are reduced

The computation graph now carries shape metadata:

#[derive(Debug, Clone)]
pub struct Shape {
    dims: Vec<usize>,
}

pub trait TensorData {
    fn shape(&self) -> &Shape;
    fn numel(&self) -> usize { self.shape().iter().product() }
}

Problem 2: Broadcasting Semantics

NumPy popularized broadcasting — automatic shape expansion for binary ops:

a = np.array([1, 2, 3])           # [3]
b = np.array([[1], [2], [3]])     # [3, 1]
c = a + b                          # [3, 3]!

The rules:

Align shapes from the right
Each dimension must match or be 1
Size-1 dims get “stretched” to match

    [3]    pads to  [1, 3]
 [3, 1]             [3, 1]
 ------             ------
 result:            [3, 3]

This is convenient for users but creates a gradient problem.

See Broadcasting in Action

Here’s an interactive visualization showing the complete broadcasting flow. Hover over each step to highlight it:

Dashed cells show "virtual" copies from broadcasting. Visualization by Claude Opus 4.5

The Broadcast Gradient Problem

Forward pass:

a: [3]     broadcasts to [3, 3]
b: [3, 1]  broadcasts to [3, 3]
c = a + b: [3, 3]

Backward pass:

dc: [3, 3]  ← upstream gradient
da: [???]   ← need shape [3]
db: [???]   ← need shape [3, 1]

The gradient dc has shape [3, 3], but a had shape [3]. How do we get back?

The answer: sum over the broadcast dimensions.

When a was broadcast from [3] to [3, 3], each element of a contributed to 3 elements of c. In the backward pass, we sum those 3 gradient contributions:

// During forward: a [3] → broadcast → [3, 3]
// During backward: dc [3, 3] → sum axis 0 → da [3]

fn sum_to(grad: &Tensor, target_shape: &Shape) -> Tensor {
    // Find axes that were broadcast (added or stretched)
    // Sum over those axes to reduce back to target_shape
    let mut result = grad.clone();
    for axis in broadcast_axes(grad.shape(), target_shape) {
        result = result.sum(Some(&[axis]), false);
    }
    result.reshape(target_shape)
}

This is the key insight: broadcasting in forward = reduction in backward.

Implementing sum_to

fn compute_broadcast_axes(from: &Shape, to: &Shape) -> Vec<usize> {
    let mut axes = vec![];
    let offset = to.ndim() - from.ndim();

    // Axes that were added (leading dims)
    for i in 0..offset {
        axes.push(i);
    }

    // Axes that were stretched (size 1 → size N)
    for i in 0..from.ndim() {
        if from.dim(i) == 1 && to.dim(i + offset) > 1 {
            axes.push(i + offset);
        }
    }

    axes
}

fn sum_to<B: Backend>(grad: &B::Tensor, target: &Shape) -> B::Tensor {
    let axes = compute_broadcast_axes(target, grad.shape());
    if axes.is_empty() {
        return B::clone_tensor(grad);
    }

    let mut result = B::clone_tensor(grad);
    // Sum in reverse order to keep axis indices valid
    for &axis in axes.iter().rev() {
        result = B::sum(&result, Some(&[axis]), false);
    }

    // Reshape in case we summed to scalar-like shape
    B::reshape(&result, target)
}

Every binary operation that broadcasts must use sum_to in its backward pass:

TensorOp::Add => {
    // Forward: broadcast both to output shape, add
    // Backward: sum gradients back to input shapes
    vec![
        Some(B::sum_to(upstream_grad, children[0].shape())),
        Some(B::sum_to(upstream_grad, children[1].shape())),
    ]
}

TensorOp::Mul => {
    // d(a*b)/da = b, d(a*b)/db = a
    // But we need to broadcast, multiply, then sum_to
    let b_expanded = B::broadcast_to(children[1].data(), output_shape);
    let a_expanded = B::broadcast_to(children[0].data(), output_shape);

    let local_a = B::mul(upstream_grad, &b_expanded);
    let local_b = B::mul(upstream_grad, &a_expanded);

    vec![
        Some(B::sum_to(&local_a, children[0].shape())),
        Some(B::sum_to(&local_b, children[1].shape())),
    ]
}

Problem 3: Reduction Operations

Reductions (sum, mean, max) collapse dimensions:

let x = tensor([[1, 2, 3],
                [4, 5, 6]]);  // [2, 3]

x.sum(None, false)         // → scalar (sum all)
x.sum(Some(&[0]), false)   // → [3] (sum rows)
x.sum(Some(&[1]), false)   // → [2] (sum cols)
x.sum(Some(&[1]), true)    // → [2, 1] (keepdims)

Reduction and Gradient Flow

Reductions collapse dimensions going forward. The gradient does the reverse — it expands back to the original shape:

Reduction collapses axes; gradient broadcast restores them. Visualization by Claude Opus 4.5

The gradient for sum is straightforward — broadcast back to input shape:

TensorOp::Sum { axes, keepdims } => {
    let input_shape = children[0].shape();

    let expanded = if *keepdims {
        upstream_grad.clone()
    } else {
        // Need to unsqueeze removed axes first
        expand_for_broadcast(upstream_grad, input_shape, axes)
    };

    vec![Some(B::broadcast_to(&expanded, input_shape))]
}

Mean divides by the count:

TensorOp::Mean { axes, keepdims } => {
    let count: usize = match axes {
        Some(ax) => ax.iter().map(|&i| input_shape.dim(i)).product(),
        None => input_shape.numel(),
    };

    let sum_grad = /* same as Sum */;
    vec![Some(B::div(&sum_grad, &B::scalar(count as f32)))]
}

Max/Min are trickier — gradient only flows to the maximum element(s):

TensorOp::Max { axes, keepdims } => {
    // Create mask: 1 where input == max value, 0 elsewhere
    let expanded_max = expand_and_broadcast(output, input_shape, axes);
    let mask = B::eq(children[0].data(), &expanded_max);

    let expanded_grad = expand_and_broadcast(upstream_grad, input_shape, axes);
    vec![Some(B::mul(&expanded_grad, &mask))]
}

Problem 4: Matrix Multiplication Gradients

MatMul is the workhorse of deep learning. For C = A @ B:

A: [M, K]
B: [K, N]
C: [M, N]

The gradients involve transposes:

\[\frac{\partial L}{\partial A} = \frac{\partial L}{\partial C} \cdot B^T\] \[\frac{\partial L}{\partial B} = A^T \cdot \frac{\partial L}{\partial C}\]

TensorOp::MatMul => {
    let a = children[0].data();
    let b = children[1].data();

    // dC is [M, N]
    // dA = dC @ B^T → [M, N] @ [N, K] → [M, K] ✓
    // dB = A^T @ dC → [K, M] @ [M, N] → [K, N] ✓
    let grad_a = B::matmul(upstream_grad, &B::transpose(b, None));
    let grad_b = B::matmul(&B::transpose(a, None), upstream_grad);

    vec![Some(grad_a), Some(grad_b)]
}

Batched matmul (e.g., [batch, M, K] @ [batch, K, N]) follows the same pattern but operates on the last two dimensions.

The Complete Backward Pass

With these pieces, the backward algorithm is the same as scalars:

pub fn backward<B: Backend>(output: &Tensor<B>) -> Gradients<B> {
    let topo_order = topological_sort(output);

    // Key difference: initialize with ones(shape) not 1.0
    let mut adjoints: HashMap<NodeId, B::Tensor> = HashMap::new();
    adjoints.insert(output.id(), B::ones(output.shape()));

    for expr in topo_order.iter().rev() {
        let Some(upstream) = adjoints.get(&expr.id()) else { continue };
        let upstream = B::clone_tensor(upstream);

        let child_grads = compute_local_gradients::<B>(expr, &upstream);

        for (child, grad) in expr.children().iter().zip(child_grads) {
            if let Some(grad) = grad {
                adjoints
                    .entry(child.id())
                    .and_modify(|acc| B::accumulate_grad(acc, &grad))
                    .or_insert(grad);
            }
        }
    }

    // ...
}

The differences from scalar autodiff:

B::ones(shape) instead of 1.0
compute_local_gradients returns tensors, handles broadcasting
accumulate_grad adds tensors element-wise

Testing Tensor Gradients

Numerical gradient checking still works, just with more points:

fn numerical_gradient<B, F>(f: F, x: &Tensor<B>, eps: f32) -> B::Tensor
where
    F: Fn(&Tensor<B>) -> Tensor<B>,
{
    let mut grad_data = vec![0.0; x.numel()];

    for i in 0..x.numel() {
        // Perturb element i by ±eps
        let mut plus_data = x.as_slice().to_vec();
        let mut minus_data = x.as_slice().to_vec();
        plus_data[i] += eps;
        minus_data[i] -= eps;

        let plus = Tensor::from_vec(plus_data, x.shape().clone());
        let minus = Tensor::from_vec(minus_data, x.shape().clone());

        let plus_out = f(&plus).sum(None, false).item();
        let minus_out = f(&minus).sum(None, false).item();

        grad_data[i] = (plus_out - minus_out) / (2.0 * eps);
    }

    B::from_vec(grad_data, x.shape().clone())
}

This is O(n) forward passes for n elements — slow, but good for testing.

Summary: What’s New with Tensors

Scalar Autodiff	Tensor Autodiff
Values are `f64`	Values are `Tensor` with shape
Ops take scalars	Ops handle broadcasting
Gradients are scalars	Gradients are tensors, same shape as inputs
—	Need `sum_to` for broadcast gradients
—	Need `expand` for reduction gradients
—	MatMul gradients involve transposes

The core algorithm (topo sort, reverse traversal, chain rule) is unchanged. The bookkeeping around shapes is what makes tensor autodiff more complex.

What’s Next

We have tensors with gradients. But an autodiff engine isn’t a neural network yet. We need:

Layers: Encapsulate parameters and forward computation
Models: Compose layers into trainable architectures
Loss functions: Define what “correct” means

Part 2 will build these building blocks.

Part 1 of the “Deep Learning from Scratch in Rust” series. This builds on the Autodiff in Rust series.