-
Building XGBoost from Scratch in Rust, Part 2 — Implementation
In Part 1, we covered the theory behind gradient boosting. Now let’s implement it. We’ll build a gradient boosted tree library in Rust that produces XGBoost-compatible models.
Read more -
From Decision Trees to XGBoost: A Visual Guide to Gradient Boosting, Part 1 — Theory
You’ve probably heard of XGBoost—it’s won countless Kaggle competitions and powers prediction systems everywhere. But how does it actually work? In this post, we’ll build up the intuition from simple decision trees to the full gradient boosting algorithm.
Read more -
Autodiff in Rust, Part 2 — A Scalar Autodiff Engine
-
Autodiff in Rust, Part 1 — Thinking in Graphs
I’ve always found automatic differentiation a bit magical. You write some math, call
.backward(), and somehow the computer figures out all the derivatives. For years I used it without really understanding it.Back in grad school, when I was writing DAE/ODE solvers using RK45 integrators, Professor Michael Baldea would mention that automatic differentiation was the “hot new research area.” It sounded like magic to me at the time. Fast forward to today, and it’s everywhere — PyTorch, JAX, tinygrad, anywhere there’s backprop and gradients. What was once cutting-edge research is now something we take for granted.
Then I sat down and implemented one from scratch. Turns out it’s not magic at all — it’s actually a beautiful idea that clicks once you see it the right way.
Read more -
What I Learned Using Claude to Rewrite a Legacy C/C++ Codebase
I recently completed a two-month project rewriting a large cross-platform (Windows/macOS) legacy C/C++ codebase to modern C++17 using Claude Opus 4.5. The codebase had all the hallmarks of decades-old code: an arcane build system (Perl scripts + Visual Studio projects + Xcode projects), raw pointers everywhere,
mallocandnewintermixed, custom container libraries, and global state scattered throughout.This post shares the key learnings from that experience.
Read more -
Building High-Performance UIs with ImGui: Lessons from Tracy Profiler
Tracy is a real-time, nanosecond resolution profiler used by game developers and performance engineers worldwide. What makes it remarkable isn’t just its profiling capabilities - it’s the fact that the entire UI, handling millions of data points with buttery-smooth 60fps rendering, is built with Dear ImGui. I recently extracted the UI boilerplate from Tracy into a standalone starter project, and the patterns I found are worth sharing.
Read more -
`std::ref` and `std::reference_wrapper` in C++
In refactoring legacy C++ codebases we often have to deal with a lot of functions or class methods that takes a pointer as an argument and then does a bunch of null checks. This is a common pattern in C++ codebases that are not modernized yet.
Modern C++ has introduced a few utilities to help with this pattern. One of them is
Read morestd::refandstd::reference_wrapper. In this post, I wanted to talk about these tools and how they can improve the safety and readability of modern C++ code. -
Let's build an asyncio runtime from scratch in Python
Read moreasyncioin Python is a library that provides a way to write concurrent code using theasyncandawaitsyntax. It is built on top of theasyncioevent loop, which is a single-threaded event loop that runs tasks concurrently. Inspired by a similar post by Jacob, we will explore howasyncioworks from scratch by implementing our own event loop runtime with Python generators. -
jthread in C++20
std::jthreadintroduced in C++20 is a new thread class that is cancellable and joinable. It is a wrapper aroundstd::threadthat provides a few additional features. In this post, I wanted to talk aboutstd::jthreadand how it can be used in modern C++ codebases.Advantages over C++11
std::thread:- cancellable, can be stopped at any time, unlike
std::threadwhich can only be stopped at the end of the thread function - works better with
RAIIpattern, since it can be joined or detached in the destructor
- cancellable, can be stopped at any time, unlike
-
Build a strong type system via Python typehints
Python typehinting system is getting more powerful by each Python version. Projects I’m involved with are now enforcing typehints on all new code. This has been great for a variety of reasons:
- Improves IDE support in terms of linting, autocompletion, and refactoring
- Makes the codebase more readable and maintainable
- Helps catch bugs early in the development cycle
In this post, I’ll share some of the additional features we’ve been able to enable now that most of our codebases are typehinted.
Read more -
Get the Python GIL play nice with C++
It is no surprise that the GIL is one of the biggest drawbacks of using Python in performance oriented applications. The GIL, or Global Interpreter Lock, is a mutex that protects access to Python objects, preventing multiple threads from executing Python bytecodes at once. This means that even if you have multiple threads running in parallel, only one of them can execute Python code at a time. This can be a major bottleneck for applications that require high performance, as it limits the amount of parallelism that can be achieved.
To defeat the GIL, there are two commonly taken path:
- the first is to opt for multiprocessing instead of threads.
- Re-write the core performance critical code using a lower level language such as C++ or Rust
Today, let’s talk about the 2nd approach. With excellent next generation binding libraries such as
pybind11andpyo3, it has become a lot simpler to support Rust/C++ code in a Python project.However, often the porting to C++ / Rust from existing application code do not happen overnight. In the beginning, it is mostly a few performance critical functions that are ported to C++ / Rust. In such cases, it is common to see a mix of Python and C++ / Rust code in the same project. In these cases, the threading architecture / parallelism code could still be in Python, while the performance critical code is in C++ / Rust.
I’ve personally dealt with such systems where the GIL became a major bottleneck in the performance of the system due to ill-undertsanding of how it worked. As a result, I’m sharing my findings here.
Read more -
Stack optimization for small sized objects in modern C++
I came across a popular technique for providing a handle for storing small objects in the handle itself and larger ones on the heap. Using modern C++, this can be implemented quite nicely at compile time. Here is a simple example:
// max bytes to store on the stack constexpr int on_stack_max = 20; template<typename T> struct Scoped { // store a T in Scoped // ... T obj; }; template<typename T> struct OnHeap { // store a T on the free store // ... T* objp; }; template<typename T> using Handle = typename std::conditional<(sizeof(T) <= on_stack_max), Scoped<T>, // first alternative OnHeap<T> // second alternative >::type; void f() { Handle<double> v1; // the double goes on the stack Handle<std::array<double, 200>> v2; // the array goes on the free store }Let’s break this down
constexpr int on_stack_max = 20;: This line defines a constant expression for the maximum number of bytes that can be stored on the stack.template<typename T> struct Scoped { T obj; };: This is a template struct that can store an object of any type T on the stack.template<typename T> struct OnHeap { T* objp; };: This is a template struct that can store a pointer to an object of any type T on the heap.template<typename T> using Handle = typename std::conditional<(sizeof(T) <= on_stack_max), Scoped<T>, OnHeap<T>>::type;: This line defines a template alias Handle that usesstd::conditionalto decide whether to useScoped<T>orOn_heap<T>. If the size ofTis less than or equal toon_stack_max, it usesScoped<T>. Otherwise, it usesOn_heap<T>.void f() { Handle<double> v1; Handle<std::array<double, 200>> v2; }: This function demonstrates how to use the Handle template.v1is a Handle that stores a double on the stack, because the size of a double is less thanon_stack_max.v2is a Handle that stores anstd::array<double, 200>on the heap, because the size ofstd::array<double, 200>is greater thanon_stack_max.
Of course, this assumes that
Tcan be copied and moved around, and that it has a finite size. IfTis not copyable or movable, you will need to adjust the implementation accordingly.This shows how powerful modern C++ can be in terms of compile-time programming. It allows you to make decisions at compile time based on the properties of types, which can lead to more efficient and flexible code.
Read more -
Dive into Python asyncio - part 2
In the second part of this series on deep diving into
asyncioandasync/awaitin Python, we will be looking at the following topics:- task, task groups, task cancellation
- async queues
- async locks and semaphores
- async context managers
- async error handling
-
Dive into Python asyncio - part 1
For as long as I have worked in Python land, I never had to touch the async part of the language. I know that
asynciolibrary has gotten a lot of love in the past few years. Recently I’ve came across an opportunity to do a lot of IO and non-cpu bound work in Python. I decided to take a deep dive into theasynciolibrary and see what it has to offer.In part 1 of this series (I originally just wanted to write one post and realized the scope is way too big), we’ll cover:
- How async code interfaces with synchronous code in Python
- How to convert synchronous code to asynchronous code, including how to prevent blocking of the event loop via custom
ThreadPoolExecutor - How to use
asyncioto run multiple tasks concurrently
Basic example, async hello world
import asyncio async def hello_world(): asyncio.sleep(1) print("Hello world") asyncio.run(hello_world()) >>> Hello worldRunning two async functions in parallel
import asyncio async def foo(): while True: asyncio.sleep(1) print("foo") async def bar(): while True: asyncio.sleep(1) print("bar") asyncio.run(asyncio.gather(foo(), bar()))What if I have existing synchronous methods?
We can wrap a synchronous function in an async function, an example implementation would be a decorator (i love decorators, btw):
def async_wrap( loop: Optional[asyncio.BaseEventLoop] = None, executor: Optional[Executor] = None ) -> Callable: def _async_wrap(func: Callable) -> Callable: @wraps(func) async def run(*args, loop=loop, executor=executor, **kwargs): if loop is None: loop = asyncio.get_event_loop() pfunc = partial(func, *args, **kwargs) return await loop.run_in_executor(executor, pfunc) return run return _async_wrapThe above decorator is a higher order decorator (it takes arguments and then generates another decorator), example usage is the following:
Read moreimport asyncio import time @async_wrap() def foo(): while True: time.sleep(1) print("foo from sync") async def bar(): while True: asyncio.sleep(1) print("bar from async") asyncio.run(asyncio.gather(foo(), bar())) -
What is copiable?
What is copiable anyway?
Python is garbage collected and has a reference counting system. This means that when you create an object, it is stored in memory and a reference to it is stored in a variable. When you assign a variable to another variable, the reference count for the object is incremented. When you delete a variable, the reference count is decremented. When the reference count reaches zero, the object is deleted from memory.
This is a very simple explanation of how Python works. There are many more details that I will not go into here. The point is that when you assign a variable to another variable, you are not creating a copy of the object. You are creating a new reference to the same object. This is important to understand because it can lead to some unexpected behavior.
Questions I had:
- What happens when you assign a variable to another variable?
- What happens when you return a complex object (i.e. a class) as part of a tuple from a function?
- What happens when you spin up a subprocess, call a method you defined in one class, and give it an object as an argument?