Blog2026.04.10
Training at Scale: Spot Instances, S3 & a DynamoDB Experiment Ledger
How to train a 200-epoch object detection model on AWS spot instances for a fraction of on-demand cost — with automatic recovery when they get reclaimed
Akhilesh Warty
BACKEND16 MIN
Training an ML model is one task but managing the infrastructure is another aspect. A 10-epoch MNIST classifier can run locally depending on your GPU, but that doesn't scale to a 200-epoch object detector. The alternative to that is a cloud infrastructure that is able to handle the training requests for the model. To facilitate this, I implemented an AWS cloud architecture system that is able to take care of these needs.
The system I architected revolves around key needs for it to be operational:
- Configuration & Fingerprinting
- Instance Provisioning
- Experiment Management
- Cloud Storage
- Containerization & Orchestration
Cloud Architecture Overview
Cloud architecture for the ML training can come in two types of instances:
- Dedicated Instance — A reserved instance that can be provisioned without any need to worry about pre-emption from the system.
- Spot Instance — An instance that is reserved temporarily that can be requisitioned for training but can also be pre-empted.
The infrastructure that I implemented in the training framework, it is able to handle both these cases in its own way. The key to this was to utilize Terraform as the IaC to handle the provisioning of the instances.
For the dedicated instances, the framework is able to provision the instance seamlessly and send a request using to the AWS cloud and use the requisitioned Amazon EC2 instance for uninterrupted training of the model. For the spot instances, the framework is able to bid on an instance in the AWS EC2 and trigger the training of the ML model.
For the spot instances there is an added complexity, where the status of the ML experiment needs to be tracked and stored somewhere so that it can be triggered again once another instance is provisioned. To do this I implemented a DynamoDB ledger for the framework, this particular choice was determined for two main reasons:
- Atomic Checks — DynamoDB is able to handle checks with its queries at atomic speed allowing for quicker tracking.
- Data Formatting — DynamoDB is a NoSQL ledger that is able to handle large amounts of data by using the Key-Value architecture
This however on its own still requires the need for cloud storage to be able to store the information regarding the entire training pipeline. I implemented dedicated buckets in AWS S3 that is responsible for storing four aspects of the framework:
- Config Files — Configuration files for the training framework that will fine tune the hyperparameters for the models.
- Datasets — Dataset files for the model to be used to train the models.
- Checkpoints — Checkpoints for the models to use in order to save the states of the training cycle.
- Artifacts — Training metrics, weights, metrics history.
Cloud Training Architecture
The AWS EC2 instance uses images to load up onto their instances so that they can run in their environment. To do this the instance loads up a Docker instance that has all the dependencies and is able to run in an isolated environment regardless of machine. The docker container also has access to the GPU and is able to use it to train the model effectively with its graph without using the CPU heavily which can slow down the training cycle.
The training loop is designed to stop when there is an error and update the DynamoDB ledger, handle the transition and keep the records up to date. This allows for the experiment to be resumed from that exact state without any loss of data or information when resumed by the Terraform fleet.
Spot Instance Economics
A g5.xlarge (NVIDIA A10G GPU, 24GB VRAM) costs roughly $1.30/hr on-demand on AWS. The same instance type on spot pricing runs around $0.16–0.22/hr depending on availability zone and time of day which results in a savings of roughly 85%.
Spot Interruption Rates
g5.xlarge instances in us-east-1 have a historical interruption frequency of roughly 5–15% per hour. In practice, most training runs complete multiple epochs between interruptions, making checkpointing per epoch sufficient to avoid significant rework.
Graceful Preemption Handling
When the SIGTERM arrives, the training process has approximately 120 seconds to respond. The GracefulShutdownHandler catches the signal and triggers an ordered shutdown:
- The current training step completes normally
- The current epoch checkpoint is saved to disk
- The checkpoint is immediately uploaded to the experiment's S3 path
- The DynamoDB ledger entry is marked
failedwith the S3 checkpoint path recorded - The process exits cleanly
The 'Failed' State Is Intentional
Marking the experiment as failed rather than some intermediate "paused" state is deliberate. The DynamoDB ledger's state machine treats failed as a recoverable state: any new spot instance that claims the experiment will find the checkpoint_s3_path pointer, download the checkpoint, and resume training automatically. The ledger's conditional writes ensure only one instance can claim a given experiment at a time.
DynamoDB Experiment Ledger
The core concurrency problem with distributed spot training is preventing two instances from running the same experiment simultaneously. DynamoDB's conditional expressions solve this cleanly: claim_experiment() uses a ConditionExpression that only succeeds if the current status is pending or failed — if two instances race to claim the same experiment, exactly one wins and the other gets a ConditionalCheckFailedException.
src/infrastructure/dynamodb_ledger.py
1def claim_experiment(self, experiment_id, fingerprint, timestamp, instance_id):2 try:3 self._table.update_item(4 Key={'experiment_id': experiment_id, 'fingerprint': fingerprint},5 UpdateExpression=(6 "SET #s = :running, claimed = :now, ec2_instance = :instance_id, "7 "run_timestamp = :timestamp REMOVE failure_reason"8 ),9 # Atomic guard: only succeeds if status is pending or failed10 ConditionExpression="#s IN (:pending, :failed)",11 ExpressionAttributeNames={'#s': 'status'},12 ExpressionAttributeValues={13 ':running': 'running',14 ':pending': 'pending',15 ':failed': 'failed',16 ':now': datetime.now(timezone.utc).isoformat(),17 ':instance_id': instance_id,18 ':timestamp': timestamp,19 }20 )21 return True # This instance claimed the experiment22 except ClientError as err:23 if err.response['Error']['Code'] == "ConditionalCheckFailedException":24 return False # Another instance claimed it first25 raiseThe full state machine transitions look like this:
Experiment Ledger State Machine
Per-epoch, the training loop calls update_checkpoint_pointer() to record the latest S3 checkpoint path. When a new instance resumes, it reads this pointer and downloads the checkpoint before starting the training loop.
S3 Checkpoint Sync
Checkpoints are stored in a structured S3 path that includes the experiment ID and its configuration fingerprint:
s3://ml-checkpoints/
└── exp002_a1b2c3d4/
├── epoch_045/
│ ├── model.weights.h5
│ └── optimizer.pkl
└── epoch_046/
├── model.weights.h5
└── optimizer.pkl
The fingerprint (a1b2c3d4) is a SHA-256 hash of the architecture-defining config parameters, truncated to a short hex prefix for readability in directory and bucket names.
A training run that resumes from a checkpoint first validates that the fingerprint matches before loading weights.
This prevents silent failures when, for example, the number of classes or anchor configuration was changed between runs.
Checkpoint Retention Policy
By default, the checkpoint manager keeps only the last k checkpoints plus the single best-performing one (by mAP). This bounds S3 storage costs while ensuring the best model is never overwritten by a later checkpoint that happens to have a worse validation metric.
Configuration System
None of this infrastructure is useful if every experiment requires editing code to change a hyperparameter. To avoid that, the framework uses a hierarchical YAML configuration system, where an experiment config declares its defaults and only overrides the values it actually needs to change.
An experiment config points at reusable base configs for each component, then layers its own overrides on top:
configs/experiments/exp001_baseline.yaml
1experiment:2id: exp0013name: mobilenetv2_ssd_baseline4tags: [baseline, mobilenetv2, voc]5 6defaults:7backbone: base/backbones/mobilenetv2.yaml8train: base/train/default.yaml9optimizer: base/optimizers/adamw_cosine.yaml10losses: base/losses/ssd_loss.yaml11 12overrides:13train:14 epochs: 5015 batch_size: 3This means the backbone, optimizer, and loss configs are shared across every experiment that uses them, and a new experiment only needs to state what is different about it. Changing the batch size for one run does not require touching the backbone or optimizer files at all.
The merge happens in a fixed order, with each layer able to override the one before it:
- Base defaults — the component configs referenced under
defaults: - Experiment YAML — the
overrides:block in the experiment file itself - CLI overrides —
key.path=valuearguments passed at launch - Environment variables —
${VAR:-default}substitutions resolved last
Verifying a Config Before Spending GPU Time
The CLI exposes a --print_config flag that prints the fully merged config and exits, and a --dry_run flag that initializes the model, data pipeline, and optimizer without actually training. Both exist for the same reason: it is much cheaper to catch a config mistake before an EC2 instance has already started billing.
Every run is also fingerprinted, by hashing the architecture-defining keys in the merged config into a SHA-256 hash, truncated to a short hex prefix for readability. The keys that go into this hash are:
- Model architecture — backbone, classification heads, localization heads, priors
- Dataset metadata — which dataset and split the run uses
- Augmentation metadata — the augmentation chain applied during training
- Input size — the model's input resolution
- Optimizer — optimizer type and hyperparameters
- Learning schedule — warmup and decay configuration
- Training config — batch size, epoch count, and related training options
- Evaluation config — metric and evaluation protocol settings
This fingerprint becomes part of the run directory name, for example exp001_a1b2c3d4, and is saved alongside the experiment in both the run directory and the DynamoDB ledger. It is checked before resuming from any checkpoint — if the config changed in a way that would make the checkpoint incompatible, the resume fails loudly instead of silently loading mismatched weights.
Fingerprints Have to Survive a Move to the Cloud
Path-specific keys, like dataset roots or a local classes file, are stripped out of the config before hashing. Without this, the same experiment would produce a different fingerprint on a laptop than it would running inside a Docker container on EC2, since the file paths are different in each environment even though the actual architecture is identical.
Infrastructure as Code (Terraform)
The entire AWS environment — EC2 spot request, S3 buckets, DynamoDB table, IAM role and policies — is defined in Terraform under infrastructure/. This means:
- The full training environment can be reproduced from scratch with
terraform apply - Infrastructure changes are version-controlled alongside the model code
- Multiple independent training runs can be launched by parameterizing the experiment ID
The EC2 user data script runs automatically on instance boot and handles the full setup: installing the NVIDIA container toolkit, pulling the training Docker image from ECR, downloading the VOC dataset from S3, and launching the training container with the correct environment variables.
Infrastructure as Code for ML
Treating cloud infrastructure as code makes ML experiments genuinely reproducible not just the model weights, but the exact compute environment, hardware configuration, and data download procedures. The Terraform state and Docker image tag provide a complete specification of the training environment.
Docker Setup
Four container images are used across the full system, though only three of them are relevant to the training infrastructure covered in this post:
| Image | Base | Purpose |
|---|---|---|
Dockerfile | tensorflow/tensorflow:2.17.0-gpu | Training with GPU support |
Dockerfile.tensorboard | — | TensorBoard syncing logs from S3 |
docker/Dockerfile.dashboard | Node 22 + Python 3.12 | Vite frontend build + FastAPI dashboard server |
docker/Dockerfile.etl | PyTorch + ultralytics | ETL worker (YOLOv8, RT-DETR, Grounding DINO, Ray) |
docker/Dockerfile.airflow | Airflow | Airflow Orchestration |
Scope of This Post
Dockerfile.dashboard and Dockerfile.etl belong to the MLOps control plane and the auto-labeling ETL pipeline respectively, both covered in their own dedicated posts. This section focuses on the three images that make up the training infrastructure itself.
Dockerfile builds the training image on top of tensorflow/tensorflow:2.17.0-gpu. It installs Python dependencies, copies the src/ directory, and sets the entrypoint to src/cli/train.py. The image is built once and pushed to ECR; every spot instance pulls the same image, so the training environment is identical regardless of which instance happens to pick up the job.
Dockerfile.tensorboard is a lightweight sidecar that polls S3 for new TensorBoard event logs and serves the TensorBoard UI. This is what lets you watch a training run live in a browser even as the underlying spot instance gets interrupted and replaced underneath it.
Dockerfile.airflow is a sidecar that hosts and maintains the Airflow operator that is able to handle the scheduled runs for the training framework as well as the ETL pipeline for creating annotated datasets.
Parallel Experiments with Docker Compose
docker-compose.yml extends this to running multiple experiments side by side on a single multi-GPU host. Each training container is pinned to its own GPU and reads its own experiment config, with TensorBoard syncing from S3 independently of any individual container's lifecycle:
docker-compose.yml (topology)
1┌────────────────────┐2│ TensorBoard │ ◄── syncs from S3 every 60s3│ localhost:6006 │4└─────────┬──────────┘5 │6 ┌────┴────┐7 │ S3 │8 └────┬────┘9 │ uploads after each epoch10 ┌──────┴──────┬───────────┐11 │ │ │12┌──┴────┐ ┌────┴───┐ ┌───┴────┐13│ exp001 │ │ exp002 │ │ exp003 │ ← one GPU each14│ GPU 0 │ │ GPU 1 │ │ GPU 2 │15└────────┘ └────────┘ └────────┘Running parallel experiments
1# Set environment2cp .env.example .env # fill in AWS creds + dataset path3 4# Launch everything5docker-compose up -d6 7# Watch a specific experiment8docker-compose logs -f training-exp0019 10# Monitor in browser at http://localhost:600611 12# Tear down13docker-compose downAdding another parallel experiment is just duplicating a service block in docker-compose.yml with a different GPU id and experiment config, no changes to the Dockerfile itself.
Orchestration with Airflow
Provisioning a spot instance and updating the DynamoDB ledger by hand works, but it does not scale past one experiment at a time. To go from "register a config" to "a GPU spins up, trains, tears itself down, and emails a report" without anyone touching a terminal, the system wraps the spot instance lifecycle in an Airflow DAG.
training_pipeline DAG
The detail worth calling out here is that the DAG never marks an experiment as running itself. The EC2 instance claims that status on boot, using the same conditional write from the DynamoDB ledger covered earlier. This means the teardown step always runs regardless of whether training succeeded, failed, or simply timed out, and the ledger can never get stuck in a state that does not reflect what's actually happening on the instance.
The full control plane, the FastAPI layer that registers experiments and triggers this DAG, the API endpoints, and the dashboard that visualizes all of it, is covered in a dedicated post on the MLOps system built around this training pipeline.
Conclusion
Spot instances only make sense as a strategy if everything around them is built to expect interruption rather than treat it as an edge case. The configuration system and fingerprinting make every run reproducible regardless of which machine resolves it, the DynamoDB ledger turns "did this experiment finish" into a single atomic check instead of a guessing game, and the SIGTERM handler means a preemption costs at most a few minutes of training rather than the whole run. Docker keeps the environment identical across every instance that picks up a job, and the Airflow layer above all of it is what eventually lets this run without anyone babysitting a terminal.
Key Takeaways
Spot instance training at roughly 85% discount is entirely practical once the infrastructure is built around interruption rather than against it: a hierarchical config system with fingerprinting keeps every run reproducible, per-epoch S3 checkpoints record progress, SIGTERM handling ensures clean shutdown, and the DynamoDB ledger's atomic conditional writes guarantee exactly-once experiment claiming across any number of concurrent instances. Terraform and Docker make the whole stack reproducible and version-controlled, and Airflow turns the entire lifecycle into something that runs unattended.
Related articles