Akhilesh Warty

Object detection is a well established use of Computer Vision (CV) to detect objects in an image or in a video. With each model come their own constraints and use cases. This can help decide what family of models are to be used in a particular scenario.

The main decisions that made me chose the MobileNetV2-SSD architecture are:

Edge Deployment — The target hardware for the model is the Jetson Orin Nano
Real-Time Inference — The refresh rate for object detection is real time (60 FPS+)
CUDA Optimized — The model needs to be optimized for FP16 computations and quantization

This led to the design choice of Single-Stage Detection (SSD). There are other stage detectors such as Two-Stage Detectors like Faster R-CNN that I have implemented before but they pose a problem in these constraints since these types of models are made for accuracy rather than speed making them the slower variant than an SSD, the other most common ones are Transformer based architectures and those take a lot of memory and CUDA compute to justify using them on constrained edge hardware such as Jetson Orin Nano.

At a high level, the SSD architecture has three major components:

Backbone — A feature extractor that is able to extract meaningful features from the image.
Extra Feature Pyramid — Three additional stride-2 layers that extend coverage to larger receptive fields.
Classification & Prediction Heads — A localization head and a classification head that take all the info and classify and create bounding boxes on the objects.

Note

Even though there are pretrained models such as YOLO 11 that have variants which run on the edge hardware the goal of this project was to create my own model so that I could update its architecture, deploy it in the edge hardware and better my skills at machine learning on the way

End-to-End Model Pipeline

The full pipeline can be understood as a sequence of transformations from raw pixels to final bounding boxes.

MobileNetV2-SSD Detection Pipeline

Input Image[B, 300, 300, 3]

MobileNetV2 Backbone17 Inverted Residual Blocks

Multi-Scale FeaturesC2 / C3 / C4 / C5

Extra Feature PyramidP6 / P7 / P8

Prediction HeadsLoc [B,N,4] + Cls [B,N,21]

NMS + Box DecodingPost-processing

Final DetectionsBoxes + Scores + Labels

The pipeline produces N predictions per image, where N is the total number of prior anchor boxes tiled across all six feature map levels. For a 300×300 input this comes out to roughly 8,732 candidates, which NMS then filters down to the final detections.

MobileNetV2 Backbone

Standard 3×3 convolutions which are used in almost every model are computationally expensive: a layer with $C_{in}$ input channels and $C_{o u t}$ output channels costs roughly $H \cdot W \cdot C_{in} \cdot C_{o u t} \cdot 9$ . The larger the image gets the more expensive these computations become and this makes it very important to keep in mind when choosing an architecture especially in edge constrained environments.

MobileNetV2 replaces this with depthwise-separable convolutions — a depthwise 3×3 that processes each channel independently, followed by a pointwise 1×1 that mixes channels:

Standard: C_{in} \times C_{o u t} \times 9 vs Depthwise-Sep: C_{in} \times 9 + C_{in} \times C_{o u t}

Parameter Count: Standard vs Depthwise-Separable

If two images were taken, one at the size of 300x300, and the other at 600x600 then the math for the two techniques at the different resolutions would show how much computation is saved by tracking the Multiply-Accumulate Computation (MAC):

MACs = H \cdot W \cdot C_{in} \cdot C_{o u t} \cdot K^{2} = 300 \cdot 300 \cdot 32 \cdot 64 \cdot 9 = 1, 658, 880, 000

Standard Conv @ 300×300

MACs = H \cdot W \cdot C_{in} \cdot C_{o u t} \cdot K^{2} = 600 \cdot 600 \cdot 32 \cdot 64 \cdot 9 = 6, 635, 520, 000

Standard Conv @ 600×600

DW PW = H W C_{in} K^{2} = 30 0^{2} \cdot 32 \cdot 9 = 25, 920, 000 = H W C_{in} C_{o u t} = 30 0^{2} \cdot 32 \cdot 64 = 184, 320, 000 = 210, 240, 000

Depthwise-Separable @ 300×300

DW PW = H W C_{in} K^{2} = 60 0^{2} \cdot 32 \cdot 9 = 103, 680, 000 = H W C_{in} C_{o u t} = 60 0^{2} \cdot 32 \cdot 64 = 737, 280, 000 = 840, 960, 000

Depthwise-Separable @ 600×600

This reduction can be seen more clearly by calculating the change over the two resolutions to see the magnitude of the reductions:

\frac{1 , 658 , 880 , 000}{210 , 240 , 000} = \frac{6 , 635 , 520 , 000}{840 , 960 , 000} \approx 7.89 \times

(1)

Compute Ratio — Constant Across Both Resolutions

The MobileNetV2 like many deep networks have an issue that it needs to deal with when multiple layers is that the gradient, it either increases or it decreases since it is a product of per-layer terms. This is the gradient exploding and vanishing problem. This architecture depends upon the ResNet architecture and its underlying math with its skip connections.

Why the Skip Connection Matters

Stacking many layers on top of each other introduces a subtle problem during training. Gradients flow backward from the loss through every layer in sequence, and at each layer the gradient gets multiplied by that layer's weights and activation derivative. By the time the signal reaches the earliest layers, it has been multiplied dozens of times over.

If those per-layer terms are consistently smaller than one, the gradient shrinks at every layer it passes through. Finally by the time it reaches the first few layers, there's barely anything left to learn from. This is the vanishing gradient problem, and it's a big part of why very deep networks used to be difficult to train. The opposite failure mode, exploding gradients, happens when those terms are consistently larger than one. This is critical as well since the model makes large changes and the training diverges.

A skip connection fixes this by giving the gradient a second path back to earlier layers. It bypasses the weight multiplications entirely. Instead of a block computing output = F(x), it computes output = F(x) + x. During the backward pass, this addition means the gradient has a direct route straight through the identity term, in addition to the route through the block's weights. Even if the weight path's contribution shrinks toward zero, the identity path guarantees a baseline gradient still gets through. The network can't fully lose its training signal just because it's deep.

A Conditional Skip

Unlike a standard ResNet block, the inverted residual's skip connection isn't applied unconditionally, it only fires when the block's stride is 1 and its input and output channel counts match. When a block downsamples or changes channel width, there's no shortcut to add: the input and output tensors are different shapes entirely. The network instead relies on the surrounding stride-1 blocks to keep gradient flow healthy across the depth of the backbone.

Why "Inverted"

A standard ResNet bottleneck compresses its input to a smaller channel count, does its processing, then expands back out. This causes it to be like a sandwich with the middle being narrow and the ends being large/wide. MobileNetV2 flips this arrangement entirely: the inverted residual block expands the channel count first, runs the depthwise convolution in that wider space, then projects back down before the skip connection is added. The heavy computation happens where the representation is wide.

This inversion exists because depthwise convolutions are weak when they only have a few channels to work with since each channel is filtered independently, so there's no cross-channel mixing to compensate for a narrow representation. Expanding first gives the depthwise step a richer space to operate in, even though the block's "ends" stay lightweight.

src/mobilenetv2ssd/models/mobilenet_v2/blocks.py

1class InvertedResidualBlock(tf.keras.layers.Layer):
2  def build(self, input_shape):
3      input_channel = int(input_shape[-1])
4      self.output_channel = self._make_divisible(
5          int(round(self.output_channel * self.alpha)), 8
6      )
7      self.expansion_channel = int(input_channel * self.expansion_factor)
8 
9      if self.expansion_factor != 1:
10          # Expand to high-dimensional space
11          self.expansion_conv = Conv2D(
12              self.expansion_channel, kernel_size=1, use_bias=False
13          )
14          self.expand_batch_norm = BatchNormalization()
15          self.expand_activation_function = ReLU(max_value=6.0)
16 
17      # Depthwise conv — one filter per channel
18      self.depthwise_conv = DepthwiseConv2D(
19          kernel_size=3, strides=self.stride, padding="same", use_bias=False
20      )
21      self.depthwise_batch_norm = BatchNormalization()
22      self.depthwise_activation_function = ReLU(max_value=6.0)
23 
24      # Project back to low-dimensional space (no activation)
25      self.projection_conv = Conv2D(self.output_channel, kernel_size=1, use_bias=False)
26      self.project_batch_norm = BatchNormalization()
27 
28  def call(self, x, training=False):
29      if self.expansion_conv is not None:
30          x = self.expansion_conv(x)
31          x = self.expand_batch_norm(x, training=training)
32          x = self.expand_activation_function(x)
33 
34      x = self.depthwise_conv(x)
35      x = self.depthwise_batch_norm(x, training=training)
36      x = self.depthwise_activation_function(x)
37 
38      x = self.projection_conv(x)
39      x = self.project_batch_norm(x, training=training)
40      return x

The full backbone stacks 17 of these blocks with increasing channel widths and strategic stride-2 layers. Four intermediate feature maps — C2, C3, C4, C5 — are extracted as skip connections at strides 4, 8, 16, and 32 respectively. These feed directly into the detection heads.

Width Multiplier

The backbone accepts an alpha parameter that scales every channel count proportionally. Setting alpha=0.75 cuts the model size by roughly 40%, enabling deployment on tighter hardware like the Hailo-8 with minimal accuracy regression. The _make_divisible() call ensures all channel counts stay divisible by 8, which is required for efficient tensor operations on most GPU architectures.

Extra Feature Pyramid

Once the MobileNet backbone creates a set of features for the model to use, they still need to go through another step of processing. This step is called the Extra Feature Pyramid Network. This particular head of the model is supposed to take the feature rich encoded tensors from the backbone and use multiple kernels on them to create feature maps for objects of different sizes to be detected.

This allows for multiple objects present in the frame to be detected with a high accuracy even though their scales could differ depending on their positioning or the nature of the object. The implemented model takes the last feature map (C5) and calculates extra features using three levels on top of it.

This makes it so that the SSD looks at feature maps of size 5x5, 3x3 and 2x2 to detect large, medium, and small objects respectively. This Feature Pyramid Network (FPN) is deliberately thin compared to the variants used in RetinaNet or the paper implementing the algorithm since it is meant to calculate meaningful features without adding extra compute and running experiments on these layers delivered the middle ground between accuracy and computation.

src/mobilenetv2ssd/models/ssd/fpn.py

1class ExtraFeaturePyramid(tf.keras.layers.Layer):
2  def build(self, input_shape):
3      for level, config in enumerate(self.extra_heads_config):
4          block = Conv2D(
5              filters=config['out_channels'],
6              strides=config.get('stride', 2),
7              kernel_size=config.get('kernel_size', 3),
8              padding="same",
9              activation="relu",
10              name=f"extra_{level}_conv"
11          )
12          self.extra_heads.append(block)
13 
14  def call(self, base_feature, training=False):
15      x = base_feature
16      extra_features = []
17      for block in self.extra_heads:
18          x = block(x, training=training)
19          extra_features.append(x)
20      return extra_features

Implementation of FPN

The layers are defined using a modular config system allowing for easy stacking of the layers if needed without changing any aspect of the code.

Priors (Anchor Boxes)

The prior boxes define where the model looks for objects. At each cell in each of the six feature maps, a set of boxes is tiled with different scales and aspect ratios. The width and height of each prior are computed as:

w_{k}^{a} = s_{k} a_{r}, h_{k}^{a} = \frac{s _{k}}{a _{r}}

Prior Box Dimensions

Where $s_{k}$ is the scale at feature level $k$ (linearly spaced between s_min and s_max) and $a_{r}$ is the aspect ratio. An extra "square" prior is added at a geometric mean scale $s_{k}^{'} = s_{k} \cdot s_{k + 1}$ , giving 4–6 anchors per cell depending on the feature level.

src/mobilenetv2ssd/models/ssd/ops/anchor_ops_tf.py

1def anchors_per_cell(scales_for_layer, ratios_for_layer):
2  scales = tf.reshape(scales_for_layer, [-1])
3  ratios = tf.reshape(ratios_for_layer, [-1])
4 
5  # Cartesian product of scales × ratios
6  scales, ratios = tf.meshgrid(scales, ratios, indexing="xy")
7  ratio_sqrt = tf.math.sqrt(ratios)
8 
9  width  = scales * ratio_sqrt
10  height = scales / ratio_sqrt
11 
12  return tf.stack([tf.reshape(width, [-1]), tf.reshape(height, [-1])], axis=1)
13 
14 
15def build_priors(image_size, strides=None, feature_map_shapes=None,
16               scales=None, aspect_ratios=None, s_min=None, s_max=None,
17               include_extra=True, clip=True):
18  if feature_map_shapes is None:
19      feature_map_shapes = calculate_feature_map_shapes(image_size, strides)
20 
21  scales = compute_scales_per_layer(scales, len(feature_map_shapes), s_min, s_max, include_extra)
22  ratios = standardize_aspect_ratios(aspect_ratios, len(feature_map_shapes))
23 
24  prior_layers = [
25      build_layer_priors(feature_map_shapes[l], image_size, scales[l], ratios[l])
26      for l in range(len(feature_map_shapes))
27  ]
28 
29  priors = concatenate_priors(prior_layers, clip)
30  meta   = compute_meta(prior_layers, image_size, strides, feature_map_shapes, scales, ratios)
31  return priors, meta

Anchor Fingerprinting

compute_meta() generates an MD5 hash of the anchor configuration using image size, feature map shapes, scales, and aspect ratios. This fingerprint is embedded in every checkpoint directory name (e.g., exp001_a1b2c3d4/). When resuming training from a saved checkpoint, the framework validates the fingerprint before loading weights, preventing silent mismatches when config parameters are changed between runs.

Prediction Heads

The SSD architecture looks at the object in two different perspectives, a classification head cls and a localization head loc. Each of them looks at the object and ask two questions:

Can this be classified as an object?
Does this bounding box have an object?

Each level's head output has shape $[B, H, W, A * 4]$ where the channel axis packs every anchor's 4 box values together. This gets reshaped to separate them out, $[B, H, W, A, 4]$ , then flattened into a single anchor list, $[B, H * W * A, 4]$ . Once every level produces this same flat shape, they can be concatenated into one tensor, $[B, N_{t o t a l}, 4]$ , so the loss function and NMS can operate on all anchors uniformly regardless of which feature map resolution they came from.

[B, H, W, A \cdot 4] ⟶ [B, H, W, A, 4] ⟶ [B, H \cdot W \cdot A, 4]

Per-Level Reshape and Flatten

The model shares these feature maps from the FPN for the two prediction heads. This allows for them to use the same pools of information and streamline the execution chain.

The classification head predicts a matrix of probabilities for the objects present in the image using a Softmax classifier. This calculates the probability of the outputs from the classification heads (Logits) into probabilities which can be used by humans and are easily interpretable.

P (y_{i} = c ∣ z_{i}) = \frac{exp ( z _{i, c} )}{\sum _{j = 1}^{C} exp ( z _{i, j} )}

(1)

Softmax — Converting Logits to Class Probabilities

In the practical sense says that there is a probability associated for each label that the model is being trained on. This allows for the best probability be picked for each region.

z_{i} = (z_{i, 1}, z_{i, 2}, \dots, z_{i, C}) \in R^{C}, i = 1, \dots, N

Classification Head Output, per Anchor

The localization head predicts four offset values per anchor: $(t_{x}, t_{y}, t_{w}, t_{h})$ . These are normalized displacements from the prior center and log-space scale adjustments. This allows for the model to refine its anchor boxes (priors) by using these adjustments to lock into the object that is being detected. This is crucial since objects can differ in size and putting a big box over a small object defeats the purpose of object detection system if most of the region is background. This is taken care of in the priors that are defined in the model, the learning aspect is how they are refined to capture the object.

(t_{x}, t_{y}, t_{w}, t_{h}) — offsets from the matched prior, not absolute coordinates

Localization Head Output per Anchor

The offsets on their own are not enough to infer how the model is working since they need to be converted into box coordinates for a person to observe.

x = t_{x} \cdot w_{a} + x_{a}, y = t_{y} \cdot h_{a} + y_{a}

(1)

Decoding Predicted Offsets into Box Coordinates

w = w_{a} \cdot e^{t_{w}}, h = h_{a} \cdot e^{t_{h}}

(2)

Decoding Width and Height

The code underneath shows the way the classification and localization head are implemented.

src/mobilenetv2ssd/models/ssd/ops/heads_tf.py

1class LocalizationHead(tf.keras.Layer):
2  def call(self, feature_maps, training=False):
3      outputs = []
4      for layer, feature_map in enumerate(feature_maps):
5          num_anchors = self.num_anchors_per_layer[layer]
6          x = self.heads[layer](feature_map, training=training)
7 
8          B = tf.shape(x)[0]
9          H, W = tf.shape(x)[1], tf.shape(x)[2]
10          x = tf.reshape(x, [B, H * W * num_anchors, 4])
11          outputs.append(x)
12 
13      return tf.concat(outputs, axis=1)  # [B, N_total, 4]
14 
15 
16class ClassificationHead(tf.keras.Layer):
17  def call(self, feature_maps, training=False):
18      outputs = []
19      for layer, feature_map in enumerate(feature_maps):
20          num_anchors = self.num_anchors_per_location[layer]
21          C = self.number_of_classes
22          x = self.final_heads[layer](feature_map, training=training)
23 
24          B = tf.shape(x)[0]
25          H, W = tf.shape(x)[1], tf.shape(x)[2]
26          x = tf.reshape(x, [B, H * W * num_anchors, C])
27          outputs.append(x)
28 
29      return tf.concat(outputs, axis=1)  # [B, N_total, num_classes]

Data Pipeline

A model is only as good as what it is shown during training, and object detection has a few data handling problems that do not show up in simpler classification tasks. The pipeline that feeds this model is built around PASCAL VOC 2012, a 20-class benchmark with roughly 17,000 images, each one annotated with bounding boxes stored as XML files alongside the JPEGs. A fixed split file lists which image ids belong to training versus validation, so every run is evaluated against the same held out set.

VOC to GPU Data Pipeline

VOC Split Filetrain.txt / val.txt

VOC ParserJPEG + XML → tensors

AugmentationPhotometric + Geometric

Clip & Filter BoxesRemove degenerate boxes

Padded BatchUniform tensor shapes + mask

PrefetchGPU pipeline overlap

GPU Training StepForward + backward pass

Before any image reaches the model, it passes through an augmentation chain split into two categories. Photometric transforms change pixel values without touching the boxes at all, things like random brightness, contrast, saturation, and hue jitter. Geometric transforms are the harder case, since anything that changes the image layout, a horizontal flip or a resize, has to update the box coordinates in lockstep with the image. Getting this wrong does not throw an error, it just silently teaches the model the wrong boxes.

Why Boxes Need Clipping

After a crop or resize, a box can shrink down to a sliver of its original area or get pushed outside the image entirely. If that box is left in the batch, it still contributes to the loss with a target that no longer makes sense. A clip and filter step runs after every geometric transform to clamp coordinates to the image boundaries and drop any box that has shrunk below a minimum pixel area.

Object detection batches are also awkward to build in the first place, since every image can have a different number of ground truth boxes. The pipeline solves this with padded batching, where each sample's boxes and labels are padded out to a fixed maximum length using sentinel values, and a validity mask is computed alongside the batch so that target assignment and the loss function both know which entries are real objects and which ones are just padding.

TFRecords for I/O Bound Training

When training on cloud instances with slower network storage, the data pipeline becomes the bottleneck before the GPU does. The pipeline supports pre-serializing the dataset into sharded TFRecord files as a drop in replacement for the generator path, which cuts down filesystem overhead and keeps the GPU fed.

Training

The architecture defined above does not know how to detect anything on its own. Before any of these weights are useful, the model needs to be trained against a defined objective, and that starts with deciding what every single anchor is actually looking at.

Each of the roughly 8,732 anchors gets matched against the ground truth boxes in an image using Intersection over Union (IoU). Anchors with a high enough IoU against a ground truth box are labeled as positive matches for that object, and anchors that fall below a lower threshold are labeled as background. To avoid leaving any object without a match, the single best anchor for each ground truth box is always assigned positively, even if its IoU happens to fall below the usual threshold.

This matching step immediately creates an imbalance problem. A typical training image might have a handful of real objects, but thousands of anchors covering the image, so the overwhelming majority end up labeled background. Left unchecked, this would let the loss get dominated by background predictions instead of the objects that actually matter. This is corrected with hard negative mining, which is covered in detail in the dedicated training post along with the rest of this process.

Once every anchor has a label, training optimizes a single combined loss that scores the model on two separate jobs at once: how well it classifies each anchor, and how well it predicts the box offsets for the anchors that matched a real object.

L = α \cdot L_{loc} + β \cdot L_{cls}

Total Multi-Task Loss

MobileNetV2-SSD Training Loop (High-Level)

Training BatchImages + GT Boxes

Forward PassLoc + Cls Predictions

Target AssignmentIoU Matching vs Priors

Hard Negative Mining3:1 Negative:Positive

Multi-Task LossSmooth L1 + Cross-Entropy

BackpropagationGradient Computation

Optimizer StepUpdated Weights

Trained this way on PASCAL VOC 2012, the model reached 77% mAP at an IoU threshold of 0.5. The full mechanics behind target assignment, the loss derivation, mixed precision training, and weight averaging are each covered in their own dedicated posts.

Deployment

The entire point of the design decisions made throughout this post, the lightweight backbone, the thin feature pyramid, the FP16 friendly operations, comes down to this step. A model that hits 77% mAP in a notebook is not useful if it cannot actually run on the Jetson it was built for.

Getting there is a short chain of conversions rather than a single step. A trained checkpoint is exported to a TensorFlow SavedModel, the SavedModel is converted to ONNX, and the ONNX model is then quantized down to INT8 using static calibration against a set of real images, producing a model that TensorRT can compile and run efficiently on the Jetson's hardware.

Checkpoint to Jetson Export Pipeline

Trained CheckpointTensorFlow weights

SavedModel ExportServe wrapper: normalize + decode

ONNX Conversiontf2onnx

Shape ValidationONNX output assertion

INT8 QuantizationStatic calibration on real images

TensorRT CompileFP16 / INT8 engine

Jetson InferenceReal-time detection

The INT8 Tradeoff

Quantizing down to INT8 shrinks the model and speeds up inference substantially on edge hardware, at the cost of some numerical precision compared to FP32 or FP16. Calibrating against real images during quantization keeps the accuracy loss small, rather than quantizing blindly and hoping the model still performs.

The exported model is also built to be self contained. Normalization and box decoding are baked directly into the serving wrapper, so the deployed model takes a raw image in and returns decoded boxes and class scores directly, with no extra postprocessing required on the device itself.

The full export pipeline, the validation step that checks ONNX output shapes before quantization, and the dual environment setup needed since the TensorFlow and ONNX tooling do not coexist cleanly, are covered in a dedicated deployment post.

Conclusion

Putting these pieces together creates not just a model, but a full pipeline that trains it and deploys it onto the Jetson in ONNX format to make use of its CUDA cores. I also implemented a fully fledged MLOps system. The other pieces are explained in my separate articles on the blog.

Results at a Glance

77% mAP@0.5 on PASCAL VOC 2012 · ~8× parameter reduction vs standard convolutions · Deploys to TensorRT FP16 for real-time inference on Jetson Orin Nano