Akhilesh Warty

Getting a neural network to converge isn't just about the architecture — it's about the training loop decisions that determine whether gradients move in the right direction at the right speed.

SSD training in particular requires several non-obvious engineering choices to work well: the model sees thousands of "background" anchors for every real object, the loss function must balance regression and classification objectives, and numerical stability requires careful precision management. This post walks through each of these pieces in the MobileNetV2-SSD training pipeline.

Training Pipeline Overview

The full training loop involves multiple subsystems that operate in sequence on each batch.

MobileNetV2-SSD Training Loop

Training BatchImages + GT Boxes

Prior Anchors8,732 box candidates

Forward PassAMP autocast

LR ScheduleWarmup + Cosine

Target AssignmentIoU Matching

Multibox LossLoc + Cls

EMA UpdateWeight smoothing

Hard Neg Mining3:1 ratio

Optimizer StepAdamW / SGD

CheckpointBest mAP tracking

Target Assignment (IoU Matching)

Before computing any loss, each of the model's ~8,732 prior anchor boxes must be assigned a label — either a ground-truth object class or "background." This is done through IoU-based bipartite matching.

For each prior and each ground-truth box in the image, we compute the Intersection over Union:

IoU (A, B) = \frac{∣ A \cap B ∣}{∣ A \cup B ∣} = \frac{∣ A \cap B ∣}{∣ A ∣ + ∣ B ∣ - ∣ A \cap B ∣}

Intersection over Union

A prior is marked positive (matched to a ground-truth box) if its IoU exceeds 0.5, and negative (background) if its IoU falls below 0.4. Priors with IoU in the range [0.4, 0.5] are ignored during training — they're ambiguous enough that including them would introduce noisy gradients.

Matching Strategy

The matching uses bipartite conflict resolution: if multiple priors match the same ground-truth box, only the highest-IoU prior is kept. Additionally, for each ground-truth box, the single best-matching prior is always assigned positively — even if its IoU is below the threshold — to guarantee every ground truth has at least one positive anchor.

Hard Negative Mining

After matching, the class imbalance problem becomes stark: a typical training image with 3–5 objects will produce 10–20 positive anchors out of 8,732 total. Training on all negatives simultaneously would cause the loss to be completely dominated by "background" predictions.

Hard negative mining addresses this by selecting only the hardest negatives — the background anchors with the highest classification loss — rather than all negatives. The number of selected negatives is capped at 3× the number of positives:

Why 3:1 and Not More?

A higher negative ratio provides more signal, but empirically a 3:1 ratio strikes the best balance. Too many negatives cause the model to over-optimize for background suppression at the cost of recall. The key constraint is that negatives are selected by loss magnitude, not randomly — this forces the model to focus on the genuinely difficult background regions.

Loss Functions

SSD uses a multi-task loss that jointly optimizes for box localization and class prediction. The total loss is a weighted sum of the two components:

L = α \cdot L_{loc} + β \cdot L_{cls}

Total Multibox Loss

Localization loss is computed only over positive anchors (those matched to a ground-truth box), using Smooth L1 loss on the encoded box offset targets:

L_{loc} = \frac{1}{N _{pos}} i \in pos \sum c \in {x, y, w, h} \sum smooth_{L_{1}} (\hat{t}_{i}^{c} - t_{i}^{c})

(1)

Smooth L1 Regression Loss

Where the smooth L1 function transitions between quadratic and linear to reduce sensitivity to large outliers:

smooth_{L_{1}} (x) = {\frac{0.5 x ^{2}}{β} ∣ x ∣ - 0.5 β if ∣ x ∣ < β otherwise

Classification loss applies softmax cross-entropy over both positive and hard-negative anchors:

L_{cls} = - \frac{1}{N _{pos}} i \in pos+neg \sum lo g \frac{exp ( z _{y_{i}} )}{\sum _{j} exp ( z _{j} )}

(2)

Softmax Cross-Entropy Classification Loss

The box regression targets use a normalized encoding so the model predicts offsets relative to each prior's center and dimensions:

t_{x} = \frac{x - x _{a}}{w _{a}}, t_{y} = \frac{y - y _{a}}{h _{a}}, t_{w} = lo g \frac{w}{w _{a}}, t_{h} = lo g \frac{h}{h _{a}}

Box Encoding

Here is the core smooth L1 implementation, which includes a selectable reduction strategy to support both per-image and per-batch normalization modes:

src/mobilenetv2ssd/models/ssd/ops/loss_ops_tf.py

1def smooth_l1_loss(predicted_values, target, beta, reduction="sum"):
2  difference = predicted_values - target
3  absolute_difference = tf.math.abs(difference)
4 
5  small_mask = absolute_difference < beta
6  large_mask  = tf.logical_not(small_mask)
7 
8  errors = tf.where(small_mask, 0.5 * (difference ** 2) / beta, tf.zeros_like(difference))
9  errors = tf.where(large_mask, absolute_difference - (0.5 * beta), errors)
10 
11  # Sum over the four box coordinates
12  errors = tf.reduce_sum(errors, axis=-1)
13 
14  if reduction == "sum":
15      return tf.reduce_sum(errors)
16  elif reduction == "mean":
17      return tf.reduce_mean(errors)
18  else:
19      return errors  # per-anchor losses for hard negative selection

Mixed Precision Training (AMP)

Training in FP16 roughly doubles throughput on modern GPUs by fitting more data into VRAM and using the GPU's tensor cores. The challenge is that FP16's limited dynamic range (~65,504 max) can cause gradient underflow during backprop.

The AMPContext class handles this by wrapping the optimizer in TensorFlow's LossScaleOptimizer, which dynamically scales the loss upward before backprop and unscales gradients before the weight update. If scaling causes overflow, the update is skipped and the scale factor is reduced.

src/training/amp.py

1class AMPContext:
2  def setup_policy(self):
3      if self._enabled:
4          policy = tf.keras.mixed_precision.Policy(self._policy)  # "mixed_float16"
5          tf.keras.mixed_precision.set_global_policy(policy)
6      else:
7          tf.keras.mixed_precision.set_global_policy("float32")
8 
9  def wrap_optimizer(self):
10      if not self._enabled:
11          return self._base_optimizer
12 
13      if self._loss_scale == "dynamic":
14          # Dynamic loss scaling: automatically adjusts the scale factor
15          self.optimizer = tf.keras.mixed_precision.LossScaleOptimizer(
16              self._base_optimizer
17          )
18      else:
19          # Fixed loss scale for reproducibility
20          self.optimizer = tf.keras.mixed_precision.LossScaleOptimizer(
21              self._base_optimizer, initial_scale=float(self._loss_scale)
22          )
23      return self.optimizer

Selective FP32 for Numerical Stability

Not all operations are safe in FP16. Loss reduction (summing thousands of small values), IoU computation, and NMS all run in forced FP32 through a PrecisionConfig mechanism. Each operation checks should_force_fp32(op_name, precision_config) before casting. This gives fine-grained control over where the FP32 overhead is worth paying for stability.

Exponential Moving Average (EMA)

Training loss is noisy — weights oscillate around a good solution rather than converging smoothly. EMA maintains a shadow copy of the weights as a running average, which tends to generalize better than the raw training weights:

θ_{ema} \leftarrow d \cdot θ_{ema} + (1 - d) \cdot θ_{train}

EMA Weight Update

Where $d$ is the decay rate (typically 0.999). Early in training when weights are still far from optimal, a fixed decay rate would give too much weight to bad early estimates. The implementation uses an adjusted decay ramp that starts low and approaches the configured value as updates accumulate:

src/training/ema.py

1def update(self, step: int):
2  if not self.should_update(step):
3      return
4 
5  decay = tf.constant(self._decay, tf.float32)
6  num_updates = tf.cast(self._num_updates, tf.float32)
7 
8  # Ramp up slowly at the start to avoid averaging in bad early weights
9  adjusted_decay = (1 + num_updates) / (10 + num_updates)
10  decay_rate = tf.minimum(decay, adjusted_decay)
11 
12  for ema_var, model_var in zip(self._ema_vars, self._model_vars):
13      d  = tf.cast(decay_rate, ema_var.dtype)
14      ema_var.assign(d * ema_var + (1.0 - d) * model_var)
15 
16  self._num_updates.assign_add(1)
17 
18@contextmanager
19def eval_context(self, model=None):
20  # Temporarily swap in EMA weights for evaluation, then restore training weights
21  use_ema = self.should_apply_during_eval()
22  if use_ema:
23      self.apply_to(model)
24  try:
25      yield
26  finally:
27      if use_ema:
28          self.restore(model)

The eval_context() context manager is the key interface: it swaps EMA weights into the model before evaluation and atomically restores the training weights afterward. If a SIGTERM arrives during evaluation, the finally block guarantees that weights are always restored correctly.

Learning Rate Scheduling

Training with a flat learning rate throughout is rarely optimal — too high early causes instability, too low throughout causes slow convergence. The pipeline uses a linear warmup + cosine annealing schedule:

η (t) = ⎩ ⎨ ⎧ η_{m a x} \cdot \frac{t}{t _{warm}} η_{m i n} + \frac{η _{m a x} - η _{m i n}}{2} (1 + cos (π \cdot \frac{t - t _{warm}}{T - t _{warm}})) t < t_{warm} t \geq t_{warm}

Cosine Annealing with Linear Warmup

The warmup phase ramps learning rate from zero to base_lr over the first warmup_steps steps, which avoids large gradient updates at the start when batch statistics are unstable. After warmup, cosine decay gradually reduces the rate to min_lr over the remaining steps.

src/training/schedule.py

1class CosineWarmupSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
2  def __call__(self, step: tf.Tensor):
3      step = tf.cast(step, dtype=tf.float32)
4      warmup_steps = tf.cast(self.warmup_steps, dtype=tf.float32)
5      total_steps  = tf.cast(self.total_steps,  dtype=tf.float32)
6 
7      # Phase 1: linear warmup
8      warmup_lr = self.base_learning_rate * tf.minimum(
9          1.0, step / tf.maximum(1.0, warmup_steps)
10      )
11 
12      # Phase 2: cosine decay
13      progress   = tf.clip_by_value((step - warmup_steps) / (total_steps - warmup_steps), 0.0, 1.0)
14      cosine_lr  = self.minimum_learning_rate + 0.5 * (
15          self.base_learning_rate - self.minimum_learning_rate
16      ) * (1 + tf.math.cos(self.pi * progress))
17 
18      return tf.where(step < warmup_steps, warmup_lr, cosine_lr)

Conclusion

Key Takeaways

SSD training requires several pieces working together: IoU-based matching assigns targets to anchors, hard negative mining keeps the class imbalance in check, smooth L1 and softmax CE provide stable gradients for box regression and classification, AMP doubles throughput while selective FP32 preserves numerical stability, EMA produces smoother generalization, and cosine warmup scheduling keeps the optimization trajectory healthy throughout a 200-epoch run. Each of these is independently configurable through the YAML experiment system.