Akhilesh Warty

Training an ML model is one task but managing the infrastructure is another aspect. A 10-epoch MNIST classifier can run locally depending on your GPU, but that doesn't scale to a 200-epoch object detector. The alternative to that is a cloud infrastructure that is able to handle the training requests for the model. To facilitate this, I implemented an AWS cloud architecture system that is able to take care of these needs.

The system I architected revolves around key needs for it to be operational:

Configuration & Fingerprinting
Instance Provisioning
Experiment Management
Cloud Storage
Containerization & Orchestration

Cloud Architecture Overview

Cloud architecture for the ML training can come in two types of instances:

Dedicated Instance — A reserved instance that can be provisioned without any need to worry about pre-emption from the system.
Spot Instance — An instance that is reserved temporarily that can be requisitioned for training but can also be pre-empted.

The infrastructure that I implemented in the training framework, it is able to handle both these cases in its own way. The key to this was to utilize Terraform as the IaC to handle the provisioning of the instances.

For the dedicated instances, the framework is able to provision the instance seamlessly and send a request using to the AWS cloud and use the requisitioned Amazon EC2 instance for uninterrupted training of the model. For the spot instances, the framework is able to bid on an instance in the AWS EC2 and trigger the training of the ML model.

For the spot instances there is an added complexity, where the status of the ML experiment needs to be tracked and stored somewhere so that it can be triggered again once another instance is provisioned. To do this I implemented a DynamoDB ledger for the framework, this particular choice was determined for two main reasons:

Atomic Checks — DynamoDB is able to handle checks with its queries at atomic speed allowing for quicker tracking.
Data Formatting — DynamoDB is a NoSQL ledger that is able to handle large amounts of data by using the Key-Value architecture

This however on its own still requires the need for cloud storage to be able to store the information regarding the entire training pipeline. I implemented dedicated buckets in AWS S3 that is responsible for storing four aspects of the framework:

Config Files — Configuration files for the training framework that will fine tune the hyperparameters for the models.
Datasets — Dataset files for the model to be used to train the models.
Checkpoints — Checkpoints for the models to use in order to save the states of the training cycle.
Artifacts — Training metrics, weights, metrics history.

Cloud Training Architecture

TerraformIaC provisioning

EC2 Spot Instanceg5.xlarge GPU

S3 Dataset BucketVOC data on startup

Docker ContainerTF 2.17 + CUDA

DynamoDB LedgerExperiment state

Training LoopPer-epoch checkpoint

S3 CheckpointUpload after each epoch

SIGTERM Handler2-min preemption warn

New Spot InstanceResumes from S3

The AWS EC2 instance uses images to load up onto their instances so that they can run in their environment. To do this the instance loads up a Docker instance that has all the dependencies and is able to run in an isolated environment regardless of machine. The docker container also has access to the GPU and is able to use it to train the model effectively with its graph without using the CPU heavily which can slow down the training cycle.

The training loop is designed to stop when there is an error and update the DynamoDB ledger, handle the transition and keep the records up to date. This allows for the experiment to be resumed from that exact state without any loss of data or information when resumed by the Terraform fleet.

Spot Instance Economics

A g5.xlarge (NVIDIA A10G GPU, 24GB VRAM) costs roughly $1.30/hr on-demand on AWS. The same instance type on spot pricing runs around $0.16–0.22/hr depending on availability zone and time of day which results in a savings of roughly 85%.

Spot Interruption Rates

g5.xlarge instances in us-east-1 have a historical interruption frequency of roughly 5–15% per hour. In practice, most training runs complete multiple epochs between interruptions, making checkpointing per epoch sufficient to avoid significant rework.

Graceful Preemption Handling

When the SIGTERM arrives, the training process has approximately 120 seconds to respond. The GracefulShutdownHandler catches the signal and triggers an ordered shutdown:

The current training step completes normally
The current epoch checkpoint is saved to disk
The checkpoint is immediately uploaded to the experiment's S3 path
The DynamoDB ledger entry is marked failed with the S3 checkpoint path recorded
The process exits cleanly

The 'Failed' State Is Intentional

Marking the experiment as failed rather than some intermediate "paused" state is deliberate. The DynamoDB ledger's state machine treats failed as a recoverable state: any new spot instance that claims the experiment will find the checkpoint_s3_path pointer, download the checkpoint, and resume training automatically. The ledger's conditional writes ensure only one instance can claim a given experiment at a time.

DynamoDB Experiment Ledger

The core concurrency problem with distributed spot training is preventing two instances from running the same experiment simultaneously. DynamoDB's conditional expressions solve this cleanly: claim_experiment() uses a ConditionExpression that only succeeds if the current status is pending or failed — if two instances race to claim the same experiment, exactly one wins and the other gets a ConditionalCheckFailedException.

src/infrastructure/dynamodb_ledger.py

1def claim_experiment(self, experiment_id, fingerprint, timestamp, instance_id):
2  try:
3      self._table.update_item(
4          Key={'experiment_id': experiment_id, 'fingerprint': fingerprint},
5          UpdateExpression=(
6              "SET #s = :running, claimed = :now, ec2_instance = :instance_id, "
7              "run_timestamp = :timestamp REMOVE failure_reason"
8          ),
9          # Atomic guard: only succeeds if status is pending or failed
10          ConditionExpression="#s IN (:pending, :failed)",
11          ExpressionAttributeNames={'#s': 'status'},
12          ExpressionAttributeValues={
13              ':running':     'running',
14              ':pending':     'pending',
15              ':failed':      'failed',
16              ':now':         datetime.now(timezone.utc).isoformat(),
17              ':instance_id': instance_id,
18              ':timestamp':   timestamp,
19          }
20      )
21      return True  # This instance claimed the experiment
22  except ClientError as err:
23      if err.response['Error']['Code'] == "ConditionalCheckFailedException":
24          return False  # Another instance claimed it first
25      raise

The full state machine transitions look like this:

Experiment Ledger State Machine

pendingregister_experiment()

runningclaim_experiment()

successmark_success()

failedmark_failure()

pendingreset_failed()

Per-epoch, the training loop calls update_checkpoint_pointer() to record the latest S3 checkpoint path. When a new instance resumes, it reads this pointer and downloads the checkpoint before starting the training loop.

S3 Checkpoint Sync

Checkpoints are stored in a structured S3 path that includes the experiment ID and its configuration fingerprint:

s3://ml-checkpoints/
└── exp002_a1b2c3d4/
    ├── epoch_045/
    │   ├── model.weights.h5
    │   └── optimizer.pkl
    └── epoch_046/
        ├── model.weights.h5
        └── optimizer.pkl

The fingerprint (a1b2c3d4) is a SHA-256 hash of the architecture-defining config parameters, truncated to a short hex prefix for readability in directory and bucket names. A training run that resumes from a checkpoint first validates that the fingerprint matches before loading weights. This prevents silent failures when, for example, the number of classes or anchor configuration was changed between runs.

Checkpoint Retention Policy

By default, the checkpoint manager keeps only the last k checkpoints plus the single best-performing one (by mAP). This bounds S3 storage costs while ensuring the best model is never overwritten by a later checkpoint that happens to have a worse validation metric.

Configuration System

None of this infrastructure is useful if every experiment requires editing code to change a hyperparameter. To avoid that, the framework uses a hierarchical YAML configuration system, where an experiment config declares its defaults and only overrides the values it actually needs to change.

An experiment config points at reusable base configs for each component, then layers its own overrides on top:

configs/experiments/exp001_baseline.yaml

1experiment:
2id: exp001
3name: mobilenetv2_ssd_baseline
4tags: [baseline, mobilenetv2, voc]
5 
6defaults:
7backbone: base/backbones/mobilenetv2.yaml
8train: base/train/default.yaml
9optimizer: base/optimizers/adamw_cosine.yaml
10losses: base/losses/ssd_loss.yaml
11 
12overrides:
13train:
14  epochs: 50
15  batch_size: 3

This means the backbone, optimizer, and loss configs are shared across every experiment that uses them, and a new experiment only needs to state what is different about it. Changing the batch size for one run does not require touching the backbone or optimizer files at all.

The merge happens in a fixed order, with each layer able to override the one before it:

Base defaults — the component configs referenced under defaults:
Experiment YAML — the overrides: block in the experiment file itself
CLI overrides — key.path=value arguments passed at launch
Environment variables — ${VAR:-default} substitutions resolved last

Verifying a Config Before Spending GPU Time

The CLI exposes a --print_config flag that prints the fully merged config and exits, and a --dry_run flag that initializes the model, data pipeline, and optimizer without actually training. Both exist for the same reason: it is much cheaper to catch a config mistake before an EC2 instance has already started billing.

Every run is also fingerprinted, by hashing the architecture-defining keys in the merged config into a SHA-256 hash, truncated to a short hex prefix for readability. The keys that go into this hash are:

Model architecture — backbone, classification heads, localization heads, priors
Dataset metadata — which dataset and split the run uses
Augmentation metadata — the augmentation chain applied during training
Input size — the model's input resolution
Optimizer — optimizer type and hyperparameters
Learning schedule — warmup and decay configuration
Training config — batch size, epoch count, and related training options
Evaluation config — metric and evaluation protocol settings

This fingerprint becomes part of the run directory name, for example exp001_a1b2c3d4, and is saved alongside the experiment in both the run directory and the DynamoDB ledger. It is checked before resuming from any checkpoint — if the config changed in a way that would make the checkpoint incompatible, the resume fails loudly instead of silently loading mismatched weights.

Fingerprints Have to Survive a Move to the Cloud

Path-specific keys, like dataset roots or a local classes file, are stripped out of the config before hashing. Without this, the same experiment would produce a different fingerprint on a laptop than it would running inside a Docker container on EC2, since the file paths are different in each environment even though the actual architecture is identical.

Infrastructure as Code (Terraform)

The entire AWS environment — EC2 spot request, S3 buckets, DynamoDB table, IAM role and policies — is defined in Terraform under infrastructure/. This means:

The full training environment can be reproduced from scratch with terraform apply
Infrastructure changes are version-controlled alongside the model code
Multiple independent training runs can be launched by parameterizing the experiment ID

The EC2 user data script runs automatically on instance boot and handles the full setup: installing the NVIDIA container toolkit, pulling the training Docker image from ECR, downloading the VOC dataset from S3, and launching the training container with the correct environment variables.

Infrastructure as Code for ML

Treating cloud infrastructure as code makes ML experiments genuinely reproducible not just the model weights, but the exact compute environment, hardware configuration, and data download procedures. The Terraform state and Docker image tag provide a complete specification of the training environment.

Docker Setup

Four container images are used across the full system, though only three of them are relevant to the training infrastructure covered in this post:

Image	Base	Purpose
`Dockerfile`	`tensorflow/tensorflow:2.17.0-gpu`	Training with GPU support
`Dockerfile.tensorboard`	—	TensorBoard syncing logs from S3
`docker/Dockerfile.dashboard`	Node 22 + Python 3.12	Vite frontend build + FastAPI dashboard server
`docker/Dockerfile.etl`	PyTorch + ultralytics	ETL worker (YOLOv8, RT-DETR, Grounding DINO, Ray)
`docker/Dockerfile.airflow`	Airflow	Airflow Orchestration

Scope of This Post

Dockerfile.dashboard and Dockerfile.etl belong to the MLOps control plane and the auto-labeling ETL pipeline respectively, both covered in their own dedicated posts. This section focuses on the three images that make up the training infrastructure itself.

Dockerfile builds the training image on top of tensorflow/tensorflow:2.17.0-gpu. It installs Python dependencies, copies the src/ directory, and sets the entrypoint to src/cli/train.py. The image is built once and pushed to ECR; every spot instance pulls the same image, so the training environment is identical regardless of which instance happens to pick up the job.

Dockerfile.tensorboard is a lightweight sidecar that polls S3 for new TensorBoard event logs and serves the TensorBoard UI. This is what lets you watch a training run live in a browser even as the underlying spot instance gets interrupted and replaced underneath it.

Dockerfile.airflow is a sidecar that hosts and maintains the Airflow operator that is able to handle the scheduled runs for the training framework as well as the ETL pipeline for creating annotated datasets.

Parallel Experiments with Docker Compose

docker-compose.yml extends this to running multiple experiments side by side on a single multi-GPU host. Each training container is pinned to its own GPU and reads its own experiment config, with TensorBoard syncing from S3 independently of any individual container's lifecycle:

docker-compose.yml (topology)

1┌────────────────────┐
2│   TensorBoard      │ ◄── syncs from S3 every 60s
3│   localhost:6006   │
4└─────────┬──────────┘
5        │
6   ┌────┴────┐
7   │   S3    │
8   └────┬────┘
9        │ uploads after each epoch
10 ┌──────┴──────┬───────────┐
11 │             │           │
12┌──┴────┐    ┌────┴───┐  ┌───┴────┐
13│ exp001 │   │ exp002 │  │ exp003 │   ← one GPU each
14│ GPU 0  │   │ GPU 1  │  │ GPU 2  │
15└────────┘   └────────┘  └────────┘

Running parallel experiments

1# Set environment
2cp .env.example .env    # fill in AWS creds + dataset path
3 
4# Launch everything
5docker-compose up -d
6 
7# Watch a specific experiment
8docker-compose logs -f training-exp001
9 
10# Monitor in browser at http://localhost:6006
11 
12# Tear down
13docker-compose down

Adding another parallel experiment is just duplicating a service block in docker-compose.yml with a different GPU id and experiment config, no changes to the Dockerfile itself.

Orchestration with Airflow

Provisioning a spot instance and updating the DynamoDB ledger by hand works, but it does not scale past one experiment at a time. To go from "register a config" to "a GPU spins up, trains, tears itself down, and emails a report" without anyone touching a terminal, the system wraps the spot instance lifecycle in an Airflow DAG.

training_pipeline DAG

check_experimentLedger lookup, fails fast if not pending

launch_training_jobterraform apply on the EC2 fleet

wait_for_completionPolls the ledger every 2 minutes

teardown_ec2terraform destroy, runs on any outcome

email_reportFinal status, best metric, checkpoint path

The detail worth calling out here is that the DAG never marks an experiment as running itself. The EC2 instance claims that status on boot, using the same conditional write from the DynamoDB ledger covered earlier. This means the teardown step always runs regardless of whether training succeeded, failed, or simply timed out, and the ledger can never get stuck in a state that does not reflect what's actually happening on the instance.

More on This Soon

The full control plane, the FastAPI layer that registers experiments and triggers this DAG, the API endpoints, and the dashboard that visualizes all of it, is covered in a dedicated post on the MLOps system built around this training pipeline.

Conclusion

Spot instances only make sense as a strategy if everything around them is built to expect interruption rather than treat it as an edge case. The configuration system and fingerprinting make every run reproducible regardless of which machine resolves it, the DynamoDB ledger turns "did this experiment finish" into a single atomic check instead of a guessing game, and the SIGTERM handler means a preemption costs at most a few minutes of training rather than the whole run. Docker keeps the environment identical across every instance that picks up a job, and the Airflow layer above all of it is what eventually lets this run without anyone babysitting a terminal.

Key Takeaways

Spot instance training at roughly 85% discount is entirely practical once the infrastructure is built around interruption rather than against it: a hierarchical config system with fingerprinting keeps every run reproducible, per-epoch S3 checkpoints record progress, SIGTERM handling ensures clean shutdown, and the DynamoDB ledger's atomic conditional writes guarantee exactly-once experiment claiming across any number of concurrent instances. Terraform and Docker make the whole stack reproducible and version-controlled, and Airflow turns the entire lifecycle into something that runs unattended.