• Deep Learning from Scratch in Rust, Part 3 — Optimizers

    We have gradients. Now what?

    In Part 2, we built layers, models, and loss functions. Given a model and a loss, autodiff computes ∂loss/∂θ for every parameter θ. But gradients alone don’t train a model. We need an optimizer to turn gradients into parameter updates.

    Today we’ll implement the three most important optimizers: SGD, SGD with Momentum, and Adam. Along the way, we’ll see why Adam became the default choice.

    Read more
  • Deep Learning from Scratch in Rust, Part 2 — Layers, Models, and Loss

    In Part 1, we built tensor autodiff — gradients flow through multi-dimensional arrays with broadcasting and reductions handled correctly. But we still don’t have a neural network.

    What’s missing? The building blocks: layers that encapsulate learnable parameters, models that compose layers, and loss functions that define what “correct” means.

    Today we bridge the gap from “autodiff engine” to “trainable model.”

    Read more
  • Deep Learning from Scratch in Rust, Part 1 — Tensor Gradients

    In the Autodiff series, we built a working autodiff engine for scalar functions. Clean, elegant, and… completely impractical. But building it was so much fun that I decided to take it all the way — from toy scalar engine to a real deep learning framework.

    Real neural networks don’t operate on individual numbers. They process tensors — multi-dimensional arrays where a single forward pass might involve millions of values. Today we’ll generalize our scalar engine to tensors and discover the new problems that emerge.

    Spoiler: broadcasting is where the elegance gets messy.

    Read more
  • Building XGBoost from Scratch in Rust, Part 3 — Scaling to Terabytes

    In Part 2, we built a working gradient boosted tree implementation. It produces correct, XGBoost-compatible models. But try running it on 100 million rows and you’ll be waiting a while. Let’s understand why, and how production systems solve it.

    Read more
  • Building XGBoost from Scratch in Rust, Part 2 — Implementation

    In Part 1, we covered the theory behind gradient boosting. Now let’s implement it. We’ll build a gradient boosted tree library in Rust that produces XGBoost-compatible models.

    Read more
  • From Decision Trees to XGBoost: A Visual Guide to Gradient Boosting, Part 1 — Theory

    You’ve probably heard of XGBoost—it’s won countless Kaggle competitions and powers prediction systems everywhere. But how does it actually work? In this post, we’ll build up the intuition from simple decision trees to the full gradient boosting algorithm.

    Read more
  • Autodiff in Rust, Part 2 — A Scalar Autodiff Engine

    In Part 1, we built intuition for autodiff: computation as graphs, derivatives as sensitivity flowing backward, the chain rule as path multiplication.

    Now let’s make it real. We’ll implement a working autodiff engine in Rust — something you can actually use to compute gradients.

    Read more
  • Autodiff in Rust, Part 1 — Thinking in Graphs

    I’ve always found automatic differentiation a bit magical. You write some math, call .backward(), and somehow the computer figures out all the derivatives. For years I used it without really understanding it.

    Back in grad school, when I was writing DAE/ODE solvers using RK45 integrators, Professor Michael Baldea would mention that automatic differentiation was the “hot new research area.” It sounded like magic to me at the time. Fast forward to today, and it’s everywhere — PyTorch, JAX, tinygrad, anywhere there’s backprop and gradients. What was once cutting-edge research is now something we take for granted.

    Then I sat down and implemented one from scratch. Turns out it’s not magic at all — it’s actually a beautiful idea that clicks once you see it the right way.

    Read more
  • What I Learned Using Claude to Rewrite a Legacy C/C++ Codebase

    I recently completed a two-month project rewriting a large cross-platform (Windows/macOS) legacy C/C++ codebase to modern C++17 using Claude Opus 4.5. The codebase had all the hallmarks of decades-old code: an arcane build system (Perl scripts + Visual Studio projects + Xcode projects), raw pointers everywhere, malloc and new intermixed, custom container libraries, and global state scattered throughout.

    This post shares the key learnings from that experience.

    Read more
  • Building High-Performance UIs with ImGui: Lessons from Tracy Profiler

    Tracy is a real-time, nanosecond resolution profiler used by game developers and performance engineers worldwide. What makes it remarkable isn’t just its profiling capabilities - it’s the fact that the entire UI, handling millions of data points with buttery-smooth 60fps rendering, is built with Dear ImGui. I recently extracted the UI boilerplate from Tracy into a standalone starter project, and the patterns I found are worth sharing.

    Read more
  • `std::ref` and `std::reference_wrapper` in C++

    In refactoring legacy C++ codebases we often have to deal with a lot of functions or class methods that takes a pointer as an argument and then does a bunch of null checks. This is a common pattern in C++ codebases that are not modernized yet.

    Modern C++ has introduced a few utilities to help with this pattern. One of them is std::ref and std::reference_wrapper. In this post, I wanted to talk about these tools and how they can improve the safety and readability of modern C++ code.

    Read more
  • Let's build an asyncio runtime from scratch in Python

    asyncio in Python is a library that provides a way to write concurrent code using the async and await syntax. It is built on top of the asyncio event loop, which is a single-threaded event loop that runs tasks concurrently. Inspired by a similar post by Jacob, we will explore how asyncio works from scratch by implementing our own event loop runtime with Python generators.

    Read more
  • jthread in C++20

    std::jthread introduced in C++20 is a new thread class that is cancellable and joinable. It is a wrapper around std::thread that provides a few additional features. In this post, I wanted to talk about std::jthread and how it can be used in modern C++ codebases.

    Advantages over C++11 std::thread:

    • cancellable, can be stopped at any time, unlike std::thread which can only be stopped at the end of the thread function
    • works better with RAII pattern, since it can be joined or detached in the destructor
    Read more
  • Build a strong type system via Python typehints

    Python typehinting system is getting more powerful by each Python version. Projects I’m involved with are now enforcing typehints on all new code. This has been great for a variety of reasons:

    • Improves IDE support in terms of linting, autocompletion, and refactoring
    • Makes the codebase more readable and maintainable
    • Helps catch bugs early in the development cycle

    In this post, I’ll share some of the additional features we’ve been able to enable now that most of our codebases are typehinted.

    Read more
  • Get the Python GIL play nice with C++

    It is no surprise that the GIL is one of the biggest drawbacks of using Python in performance oriented applications. The GIL, or Global Interpreter Lock, is a mutex that protects access to Python objects, preventing multiple threads from executing Python bytecodes at once. This means that even if you have multiple threads running in parallel, only one of them can execute Python code at a time. This can be a major bottleneck for applications that require high performance, as it limits the amount of parallelism that can be achieved.

    To defeat the GIL, there are two commonly taken path:

    • the first is to opt for multiprocessing instead of threads.
    • Re-write the core performance critical code using a lower level language such as C++ or Rust

    Today, let’s talk about the 2nd approach. With excellent next generation binding libraries such as pybind11 and pyo3, it has become a lot simpler to support Rust/C++ code in a Python project.

    However, often the porting to C++ / Rust from existing application code do not happen overnight. In the beginning, it is mostly a few performance critical functions that are ported to C++ / Rust. In such cases, it is common to see a mix of Python and C++ / Rust code in the same project. In these cases, the threading architecture / parallelism code could still be in Python, while the performance critical code is in C++ / Rust.

    I’ve personally dealt with such systems where the GIL became a major bottleneck in the performance of the system due to ill-undertsanding of how it worked. As a result, I’m sharing my findings here.

    Read more