Parse, Don't Validate: Type-Driven Design for ML Pipelines in Python
It’s 2 AM. Your XGBoost training job has been churning through 200GB of data on a 8-GPU cluster for the last four hours. You get paged. The job crashed with a cryptic C++ stack trace from somewhere deep in XGBoost internals. After 45 minutes of squinting at logs, you find the culprit: someone wrote "binary_logistic" instead of "binary:logistic" in the training config YAML. A single misplaced underscore, and four hours of GPU time went up in smoke.
Your first instinct is to add a validation check. Maybe an if objective not in VALID_OBJECTIVES somewhere early in the pipeline. But here’s the thing – that’s playing whack-a-mole. There are hundreds of config keys, each with their own constraints. The real fix is to make it structurally impossible for a bad config to reach your training code in the first place.
This is the core idea behind “Parse, Don’t Validate” – a philosophy from the typed functional programming world that translates beautifully to Python ML pipelines.
The Problem: Death by Dictionary
If you’ve worked on any ML training pipeline of moderate complexity, this pattern will look painfully familiar:
def load_config(path: str) -> dict:
with open(path) as f:
return yaml.safe_load(f)
def setup_data(config: dict):
source = config.get("data", {}).get("source", "local")
if source == "s3":
bucket = config["data"]["s3_bucket"] # KeyError if missing
prefix = config["data"].get("s3_prefix", "")
# ... download from S3
elif source == "local":
path = config["data"]["local_path"] # KeyError if missing
# ... load from disk
else:
raise ValueError(f"Unknown source: {source}")
columns = config["data"].get("feature_columns")
if columns is None:
raise ValueError("feature_columns is required")
label = config["data"].get("label_column", "target")
return load_dataframe(path, columns, label)
def train_model(config: dict, data):
params = {
"objective": config["model"]["objective"],
"max_depth": config["model"].get("max_depth", 6),
"learning_rate": config["model"].get("learning_rate", 0.3),
"n_estimators": config["model"].get("n_estimators", 100),
"subsample": config["model"].get("subsample", 1.0),
"colsample_bytree": config["model"].get("colsample_bytree", 1.0),
"eval_metric": config["model"].get("eval_metric", "logloss"),
# ... 20 more parameters
}
# Somewhere deep in here, XGBoost will throw a C++ error
# if objective is "binary_logistic" instead of "binary:logistic"
model = xgb.XGBClassifier(**params)
model.fit(data.X, data.y)
return model
def main():
config = load_config("config.yaml")
data = setup_data(config)
model = train_model(config, data)
save_model(model, config["output"]["path"])
The dict is doing all the heavy lifting. Every function receives the same blob and picks out the keys it needs with .get() calls. There is no single place that declares what the config looks like. Validation is scattered across files and functions, and some of it doesn’t happen at all – it just crashes eventually.
The error you get at 2 AM looks something like:
XGBoostError: [10:43:21] /workspace/src/learner.cc:1115:
Unknown objective function: `binary_logistic`
Good luck tracing that back to a YAML key typo from your CI/CD pipeline config. The dict passed every check you wrote, because you didn’t write a check for that particular key’s value. You can’t write checks for everything – and even if you could, the checks themselves become a maintenance burden that drifts out of sync with reality.
Parse, Don’t Validate
The phrase “Parse, Don’t Validate” comes from Alexis King’s excellent 2019 blog post written in the context of Haskell. The core idea is deceptively simple:
- Validation checks that data satisfies some property, then passes through the original untyped data. The type system forgets you ever checked.
- Parsing checks the same properties, but produces a new, typed value as output. The type system remembers.
- The difference: after parsing, invalid states are unrepresentable. You cannot construct a
TrainingConfigwith an invalid objective, because the parser won’t let you.
This is not a Haskell thing. Between Pydantic, dataclasses, and typing.Literal, Python gives you the same superpower. I’ve written before about building a strong type system with Python typehints – this post takes those ideas and applies them to a concrete ML problem.
graph LR
subgraph Validate
direction LR
D1["raw dict"] --> V["validate()"] --> D2["raw dict"]
V --> B1["bool: ok?"]
end
subgraph Parse
direction LR
D3["raw dict"] --> P["parse()"] --> T["TypedConfig"]
P --> E["ParseError"]
end
classDef input fill:none,stroke:#60a5fa,stroke-width:2px
classDef output fill:none,stroke:#34d399,stroke-width:2px
classDef negative fill:none,stroke:#f87171,stroke-width:2px
classDef highlight fill:none,stroke:#f472b6,stroke-width:2px
class D1,D3 input
class D2 input
class T output
class B1 highlight
class E negative
With validation, you get a dict in and a dict out. The function returns a boolean (or throws), but the data itself is unchanged. Every downstream consumer still works with the raw dict and has to trust that someone, somewhere, already validated the keys it cares about.
With parsing, you get a dict in and a typed object out. The function either succeeds and hands you a TrainingConfig that is structurally guaranteed to be valid, or it fails with a clear error explaining exactly what’s wrong. There is no in-between state.
Boundary 1: Parsing Training Parameters
The Pydantic Approach
Pydantic is the go-to choice for parsing in modern Python. Here’s what our XGBoost config looks like as a Pydantic model:
from pydantic import BaseModel, Field
from typing import Literal, Annotated
class XGBoostParams(BaseModel):
model_config = {"extra": "forbid"}
objective: Literal[
"binary:logistic",
"binary:logitraw",
"multi:softmax",
"multi:softprob",
"reg:squarederror",
"reg:logistic",
"reg:pseudohubererror",
]
max_depth: Annotated[int, Field(ge=1, le=20)] = 6
learning_rate: Annotated[float, Field(gt=0, le=1)] = 0.3
n_estimators: Annotated[int, Field(ge=1, le=50_000)] = 100
subsample: Annotated[float, Field(gt=0, le=1)] = 1.0
colsample_bytree: Annotated[float, Field(gt=0, le=1)] = 1.0
min_child_weight: Annotated[float, Field(ge=0)] = 1.0
gamma: Annotated[float, Field(ge=0)] = 0.0
reg_alpha: Annotated[float, Field(ge=0)] = 0.0
reg_lambda: Annotated[float, Field(ge=0)] = 1.0
eval_metric: Literal[
"logloss", "mlogloss", "auc", "aucpr", "rmse", "mae", "error"
] = "logloss"
early_stopping_rounds: Annotated[int, Field(ge=1)] | None = None
seed: int = 42
Now try passing "binary_logistic":
ValidationError: 1 validation error for XGBoostParams
objective
Input should be 'binary:logistic', 'binary:logitraw', 'multi:softmax',
'multi:softprob', 'reg:squarederror', 'reg:logistic' or
'reg:pseudohubererror' [type=literal_error]
Compare that to the C++ stack trace from XGBoost internals. The Pydantic error tells you exactly which field is wrong, what the valid values are, and what you passed. This fires immediately at config load time, before any data is touched or any GPU is allocated.
The extra = "forbid" setting is doing critical work here: if someone types leraning_rate instead of learning_rate, Pydantic will reject it instead of silently ignoring the typo and using the default value. This catches an entire class of bugs that validation-style approaches miss entirely.
The Dataclass Alternative
Not every project needs Pydantic’s full power. If you want something lighter, dataclasses with __post_init__ works well for simpler cases:
from dataclasses import dataclass
VALID_OBJECTIVES = frozenset({
"binary:logistic", "binary:logitraw",
"multi:softmax", "multi:softprob",
"reg:squarederror", "reg:logistic",
})
@dataclass(frozen=True)
class XGBoostParams:
objective: str
max_depth: int = 6
learning_rate: float = 0.3
n_estimators: int = 100
subsample: float = 1.0
def __post_init__(self):
if self.objective not in VALID_OBJECTIVES:
raise ValueError(
f"Invalid objective '{self.objective}'. "
f"Valid options: {sorted(VALID_OBJECTIVES)}"
)
if not (0 < self.learning_rate <= 1):
raise ValueError(
f"learning_rate must be in (0, 1], got {self.learning_rate}"
)
if not (1 <= self.max_depth <= 20):
raise ValueError(
f"max_depth must be in [1, 20], got {self.max_depth}"
)
The dataclass approach is more verbose for validation logic, but has zero dependencies and is easier to reason about for smaller models. frozen=True gives you immutability for free – once parsed, the config can’t be accidentally mutated downstream.
Use Pydantic when you have complex nested configs, discriminated unions, or want schema generation. Use dataclasses when you have flat configs and want minimal dependencies. In both cases, the key insight is the same: the type signature is the documentation. Anyone reading def train(params: XGBoostParams) knows exactly what shape of data this function expects without reading a single docstring.
Boundary 2: Parsing Dataset Sources
Training pipelines typically support multiple data sources. A common pattern in YAML configs:
# S3 source
data:
source_type: s3
bucket: my-ml-datasets
prefix: training/v3/
feature_columns: [age, income, score]
label_column: target
# Local source
data:
source_type: local
path: /data/training/dataset.parquet
feature_columns: [age, income, score]
label_column: target
This is a natural fit for Pydantic’s discriminated unions:
import re
from pathlib import Path
from pydantic import BaseModel, Field, field_validator
from typing import Literal, Annotated
class S3DatasetConfig(BaseModel):
source_type: Literal["s3"]
bucket: str
prefix: str = ""
feature_columns: list[str] = Field(min_length=1)
label_column: str = "target"
@field_validator("bucket")
@classmethod
def validate_bucket_name(cls, v: str) -> str:
if not re.match(r"^[a-z0-9][a-z0-9.\-]{1,61}[a-z0-9]$", v):
raise ValueError(
f"Invalid S3 bucket name: '{v}'. "
"Must be 3-63 chars, lowercase alphanumeric, hyphens, or dots."
)
return v
class LocalDatasetConfig(BaseModel):
source_type: Literal["local"]
path: Path
feature_columns: list[str] = Field(min_length=1)
label_column: str = "target"
@field_validator("path")
@classmethod
def validate_path_exists(cls, v: Path) -> Path:
if not v.exists():
raise ValueError(f"Dataset path does not exist: {v}")
return v
DatasetConfig = Annotated[
S3DatasetConfig | LocalDatasetConfig,
Field(discriminator="source_type"),
]
The source_type field acts as the discriminator – Pydantic inspects its value to decide which model to parse into. If source_type is "s3", you get an S3DatasetConfig with bucket validation. If it’s "local", you get a LocalDatasetConfig with path existence checks. If it’s anything else, you get a clear error.
Both branches validate feature_columns – requiring at least one column. This catches the “oops, I forgot to list any features” mistake before your pipeline silently trains on zero features (yes, XGBoost will happily do this and produce a model that predicts the prior for every input).
graph TD
Y["YAML config"] --> P["parse source_type"]
P -->|"s3"| S3["S3DatasetConfig<br/>bucket + prefix"]
P -->|"local"| LC["LocalDatasetConfig<br/>path validation"]
P -->|"other"| ERR["ParseError:<br/>invalid source_type"]
S3 --> V["ValidatedDataset<br/>columns + label"]
LC --> V
classDef input fill:none,stroke:#60a5fa,stroke-width:2px
classDef output fill:none,stroke:#34d399,stroke-width:2px
classDef negative fill:none,stroke:#f87171,stroke-width:2px
classDef highlight fill:none,stroke:#f472b6,stroke-width:2px
class Y input
class P highlight
class S3,LC result
class V output
class ERR negative
classDef result fill:none,stroke:#a78bfa,stroke-width:2px
After parsing, downstream code works with DatasetConfig – it doesn’t need to know or care whether the data came from S3 or local disk. The union type gives you exhaustive pattern matching if you need it, or you can just call common methods that both variants share.
Boundary 3: Composing the Training Pipeline
Now we have typed configs for model parameters and dataset sources. Let’s compose them into a full training pipeline. This is where dependency injection meets parsing – the factory takes only typed, validated inputs.
graph LR
XP["XGBoostParams"] --> TF["TrainerFactory"]
DC["DatasetConfig"] --> TF
IC["InfraConfig"] --> TF
TF --> TR["XGBoostTrainer"]
classDef input fill:none,stroke:#60a5fa,stroke-width:2px
classDef output fill:none,stroke:#34d399,stroke-width:2px
classDef highlight fill:none,stroke:#f472b6,stroke-width:2px
class XP,DC,IC input
class TF highlight
class TR output
The TrainerFactory is deliberately boring. It takes typed configs and wires them together. No validation, no .get() calls, no if guards – just construction:
from dataclasses import dataclass
@dataclass
class InfraConfig(BaseModel):
model_config = {"extra": "forbid"}
n_gpus: Annotated[int, Field(ge=0)] = 0
n_workers: Annotated[int, Field(ge=1)] = 1
output_dir: Path = Path("./output")
checkpoint_every_n: Annotated[int, Field(ge=1)] = 10
class XGBoostTrainer:
def __init__(
self,
params: XGBoostParams,
dataset: DatasetConfig,
infra: InfraConfig,
):
self.params = params
self.dataset = dataset
self.infra = infra
def run(self):
data = self._load_data()
model = self._train(data)
self._save(model)
def _load_data(self):
match self.dataset:
case S3DatasetConfig(bucket=bucket, prefix=prefix):
return download_and_load(bucket, prefix, self.dataset.feature_columns)
case LocalDatasetConfig(path=path):
return load_parquet(path, self.dataset.feature_columns)
def _train(self, data):
# self.params is guaranteed valid -- just use it
model = xgb.XGBClassifier(
objective=self.params.objective,
max_depth=self.params.max_depth,
learning_rate=self.params.learning_rate,
n_estimators=self.params.n_estimators,
subsample=self.params.subsample,
colsample_bytree=self.params.colsample_bytree,
eval_metric=self.params.eval_metric,
early_stopping_rounds=self.params.early_stopping_rounds,
random_state=self.params.seed,
n_jobs=self.infra.n_workers,
device="cuda" if self.infra.n_gpus > 0 else "cpu",
)
model.fit(data.X, data.y)
return model
def _save(self, model):
self.infra.output_dir.mkdir(parents=True, exist_ok=True)
model.save_model(self.infra.output_dir / "model.json")
Notice how _train has zero defensive checks. It doesn’t verify that objective is valid, or that learning_rate is positive, or that max_depth is reasonable. It doesn’t need to – the type system guarantees these properties. The XGBoostParams object cannot exist in an invalid state.
Now the main() entrypoint ties it all together with a clean two-phase structure:
import sys
import yaml
from pydantic import ValidationError
class TrainingConfig(BaseModel):
model_config = {"extra": "forbid"}
model: XGBoostParams
data: DatasetConfig
infra: InfraConfig = InfraConfig()
def main():
# === Phase 1: Parse (fast, fails clearly) ===
try:
raw = yaml.safe_load(open(sys.argv[1]))
config = TrainingConfig.model_validate(raw)
except ValidationError as e:
print(f"Config validation failed:\n{e}", file=sys.stderr)
sys.exit(1)
except (FileNotFoundError, yaml.YAMLError) as e:
print(f"Could not load config: {e}", file=sys.stderr)
sys.exit(1)
# === Phase 2: Execute (expensive, safe) ===
trainer = XGBoostTrainer(
params=config.model,
dataset=config.data,
infra=config.infra,
)
trainer.run()
The two-phase pattern is the key architectural insight:
graph LR
subgraph Phase1["Phase 1: Parse"]
direction LR
Y2["YAML"] --> P2["Pydantic<br/>models"]
P2 --> TC["TrainingConfig"]
P2 --> FE["Fail Early<br/>(clear errors)"]
end
subgraph Phase2["Phase 2: Execute"]
direction LR
TC2["TrainingConfig"] --> TR2["Train"] --> SV["Save"]
end
TC --> TC2
classDef input fill:none,stroke:#60a5fa,stroke-width:2px
classDef output fill:none,stroke:#34d399,stroke-width:2px
classDef negative fill:none,stroke:#f87171,stroke-width:2px
classDef highlight fill:none,stroke:#f472b6,stroke-width:2px
classDef progress fill:none,stroke:#fbbf24,stroke-width:2px
class Y2 input
class P2 highlight
class TC,TC2 output
class FE negative
class TR2,SV progress
Phase 1 takes milliseconds and catches every structural error in your config. Phase 2 takes hours and costs real money on GPU clusters. By the time you enter Phase 2, every config value is typed, bounded, and guaranteed valid. The 2 AM page never happens.
Pushing Parse to CI/CD: The Compile Step Python Doesn’t Have
Statically typed languages have a built-in safety net: if your types are wrong, the compiler catches it before your code ever runs. Python doesn’t have a compiler, but we can simulate one using pytest and our Pydantic models.
The idea is simple: glob all your training config YAML files and parse them through the models in a test. Bad config = failed build, not failed training run.
from pathlib import Path
import pytest
import yaml
CONFIG_DIR = Path("configs/training")
def discover_configs():
return sorted(CONFIG_DIR.glob("**/*.yaml"))
@pytest.mark.parametrize(
"config_path",
discover_configs(),
ids=lambda p: str(p.relative_to(CONFIG_DIR)),
)
def test_training_config_parses(config_path: Path):
"""Every training config YAML must parse into a valid TrainingConfig."""
raw = yaml.safe_load(config_path.read_text())
config = TrainingConfig.model_validate(raw)
# Sanity: parsed config should round-trip cleanly
assert config.model.objective is not None
assert len(config.data.feature_columns) > 0
Now your CI pipeline catches the binary_logistic typo before it ever reaches a training cluster. The test output is clear:
FAILED configs/training/prod/classifier_v3.yaml -
ValidationError: objective - Input should be 'binary:logistic', ...
You can go further and add this as a pre-commit hook:
# .pre-commit-config.yaml
repos:
- repo: local
hooks:
- id: validate-training-configs
name: Validate training configs
entry: pytest tests/test_config_parse.py -x -q
language: system
files: 'configs/training/.*\.yaml$'
pass_filenames: false
This means a developer literally cannot commit a bad training config. The parse boundary has moved from “2 AM on a GPU cluster” all the way to “before the code leaves your laptop.”
Runtime Pre-flight Checks
CI catches structural errors – wrong types, missing keys, invalid values. But some things can only be checked at runtime: Does the S3 bucket exist and do we have read permissions? Are GPUs actually available? Is the dataset fresh enough?
These checks belong in a pre-flight phase that runs after parsing but before any expensive work:
@dataclass
class PreflightError:
check: str
message: str
def preflight_checks(config: TrainingConfig) -> list[PreflightError]:
"""Run all runtime pre-flight checks. Returns accumulated errors."""
errors: list[PreflightError] = []
# Check GPU availability
if config.infra.n_gpus > 0:
import torch
available = torch.cuda.device_count()
if available < config.infra.n_gpus:
errors.append(PreflightError(
check="gpu_availability",
message=f"Requested {config.infra.n_gpus} GPUs, "
f"but only {available} available",
))
# Check data source accessibility
match config.data:
case S3DatasetConfig(bucket=bucket, prefix=prefix):
if not check_s3_access(bucket, prefix):
errors.append(PreflightError(
check="s3_access",
message=f"Cannot read s3://{bucket}/{prefix}",
))
case LocalDatasetConfig(path=path):
if not path.exists():
errors.append(PreflightError(
check="local_data",
message=f"Dataset not found: {path}",
))
# Check output directory is writable
try:
config.infra.output_dir.mkdir(parents=True, exist_ok=True)
test_file = config.infra.output_dir / ".preflight_check"
test_file.touch()
test_file.unlink()
except OSError as e:
errors.append(PreflightError(
check="output_writable",
message=f"Cannot write to {config.infra.output_dir}: {e}",
))
return errors
The key detail here is error accumulation. Instead of failing on the first problem, we collect all failures and report them at once. If your S3 credentials are wrong and you requested too many GPUs, you want to know about both in a single pass – not fix one, re-run, wait 30 seconds, and discover the other.
The updated main() becomes:
def main():
# Phase 1: Parse
try:
raw = yaml.safe_load(open(sys.argv[1]))
config = TrainingConfig.model_validate(raw)
except ValidationError as e:
print(f"Config validation failed:\n{e}", file=sys.stderr)
sys.exit(1)
# Phase 1.5: Pre-flight
errors = preflight_checks(config)
if errors:
print("Pre-flight checks failed:", file=sys.stderr)
for err in errors:
print(f" [{err.check}] {err.message}", file=sys.stderr)
sys.exit(1)
# Phase 2: Execute
trainer = XGBoostTrainer(
params=config.model,
dataset=config.data,
infra=config.infra,
)
trainer.run()
Practical Notes
A few battle-tested observations from applying this pattern across multiple ML projects:
-
Start at the edges, not the middle. CLI arguments, config files, API requests – these are your system boundaries. Parse them into typed objects on entry. Don’t try to retrofit types into the middle of an existing pipeline all at once.
-
extra="forbid"is your best friend. It catches YAML key typos that would otherwise silently fall back to defaults. A typo likeleraning_rate: 0.01gets rejected instead of training at the default 0.3 for 12 hours. -
Use
Annotated[float, Field(gt=0, le=1)]for range constraints. This is cleaner than__post_init__validation and shows up in generated JSON schemas – useful if your configs are also served via an API. -
Parse feature schemas early. If your pipeline does column renaming or feature engineering, encode the expected column names in the config model. A missing column should fail at parse time, not inside
model.fit()after two hours of preprocessing. -
Don’t parse in hot loops. Pydantic model construction has overhead. Parse once at the boundary, then pass the typed object everywhere. This is the natural outcome of boundary-focused design anyway.
-
Compose models, don’t nest dicts.
TrainingConfigcontainsXGBoostParams,DatasetConfig, andInfraConfigas typed fields. Each can be tested independently, reused across pipelines, and documented by its own schema. -
Use
model_validatefor external data, direct construction for internal data. When building a config in test code, just callXGBoostParams(objective="binary:logistic", ...)directly – Pydantic still validates. Reservemodel_validatefor untrusted inputs like YAML files and API payloads.
The Full Picture
Here’s the end-to-end flow from raw YAML to trained model, with every parse boundary marked:
graph TD
YAML["YAML config file"] --> PARSE["TrainingConfig.model_validate()"]
PARSE --> |"ParseError"| FAIL1["Exit with clear error message"]
PARSE --> |"Success"| TC["TrainingConfig"]
TC --> XP["XGBoostParams<br/>(validated)"]
TC --> DC["DatasetConfig<br/>(S3 or Local)"]
TC --> IC["InfraConfig<br/>(validated)"]
XP --> PF["Pre-flight Checks"]
DC --> PF
IC --> PF
PF --> |"Errors"| FAIL2["Exit with all failures"]
PF --> |"Clear"| TRAIN["XGBoostTrainer.run()"]
TRAIN --> DATA["Load Data"]
DATA --> FIT["model.fit()"]
FIT --> SAVE["Save Model"]
classDef input fill:none,stroke:#60a5fa,stroke-width:2px
classDef output fill:none,stroke:#34d399,stroke-width:2px
classDef negative fill:none,stroke:#f87171,stroke-width:2px
classDef highlight fill:none,stroke:#f472b6,stroke-width:2px
classDef progress fill:none,stroke:#fbbf24,stroke-width:2px
classDef result fill:none,stroke:#a78bfa,stroke-width:2px
class YAML input
class PARSE highlight
class FAIL1,FAIL2 negative
class TC result
class XP,DC,IC result
class PF highlight
class TRAIN,DATA,FIT progress
class SAVE output
The raw YAML enters the system exactly once at the top. By the time we reach the expensive operations – downloading data, allocating GPUs, training models – every config value has been parsed, typed, range-checked, and pre-flight verified. The dict is gone. The training code works exclusively with typed objects that cannot represent invalid states.
Back to our 2 AM scenario. With this architecture in place, the developer who writes binary_logistic gets stopped at one of three checkpoints:
- Pre-commit hook rejects the config before it’s committed
- CI test catches it if the hook was bypassed
- Parse phase catches it in the first 100ms of the training job, before any GPU is touched
You sleep through the night. The config typo becomes a Slack message you see over coffee in the morning: “CI failed on your config change, here’s what to fix.” That’s the power of pushing validation to the boundary and letting the type system carry the proof forward.