Blog2026.03.05
Building MobileNetV2-SSD: An Edge-Optimized Object Detection Architecture
How I built a MobileNetV2-SSD object detector from scratch in TensorFlow, reaching 77% mAP on VOC while running in real time on Jetson edge hardware
Akhilesh Warty
ML22 MIN
Object detection is a well established use of Computer Vision (CV) to detect objects in an image or in a video. With each model come their own constraints and use cases. This can help decide what family of models are to be used in a particular scenario.
The main decisions that made me chose the MobileNetV2-SSD architecture are:
- Edge Deployment — The target hardware for the model is the Jetson Orin Nano
- Real-Time Inference — The refresh rate for object detection is real time (60 FPS+)
- CUDA Optimized — The model needs to be optimized for FP16 computations and quantization
This led to the design choice of Single-Stage Detection (SSD). There are other stage detectors such as Two-Stage Detectors like Faster R-CNN that I have implemented before but they pose a problem in these constraints since these types of models are made for accuracy rather than speed making them the slower variant than an SSD, the other most common ones are Transformer based architectures and those take a lot of memory and CUDA compute to justify using them on constrained edge hardware such as Jetson Orin Nano.
At a high level, the SSD architecture has three major components:
- Backbone — A feature extractor that is able to extract meaningful features from the image.
- Extra Feature Pyramid — Three additional stride-2 layers that extend coverage to larger receptive fields.
- Classification & Prediction Heads — A localization head and a classification head that take all the info and classify and create bounding boxes on the objects.
Note
Even though there are pretrained models such as YOLO 11 that have variants which run on the edge hardware the goal of this project was to create my own model so that I could update its architecture, deploy it in the edge hardware and better my skills at machine learning on the way
End-to-End Model Pipeline
The full pipeline can be understood as a sequence of transformations from raw pixels to final bounding boxes.
MobileNetV2-SSD Detection Pipeline
The pipeline produces N predictions per image, where N is the total number of prior anchor boxes tiled across all six feature map levels. For a 300×300 input this comes out to roughly 8,732 candidates, which NMS then filters down to the final detections.
MobileNetV2 Backbone
Standard 3×3 convolutions which are used in almost every model are computationally expensive: a layer with input channels and output channels costs roughly . The larger the image gets the more expensive these computations become and this makes it very important to keep in mind when choosing an architecture especially in edge constrained environments.
MobileNetV2 replaces this with depthwise-separable convolutions — a depthwise 3×3 that processes each channel independently, followed by a pointwise 1×1 that mixes channels:
If two images were taken, one at the size of 300x300, and the other at 600x600 then the math for the two techniques at the different resolutions would show how much computation is saved by tracking the Multiply-Accumulate Computation (MAC):
This reduction can be seen more clearly by calculating the change over the two resolutions to see the magnitude of the reductions:
The MobileNetV2 like many deep networks have an issue that it needs to deal with when multiple layers is that the gradient, it either increases or it decreases since it is a product of per-layer terms. This is the gradient exploding and vanishing problem. This architecture depends upon the ResNet architecture and its underlying math with its skip connections.
Why the Skip Connection Matters
Stacking many layers on top of each other introduces a subtle problem during training. Gradients flow backward from the loss through every layer in sequence, and at each layer the gradient gets multiplied by that layer's weights and activation derivative. By the time the signal reaches the earliest layers, it has been multiplied dozens of times over.
If those per-layer terms are consistently smaller than one, the gradient shrinks at every layer it passes through. Finally by the time it reaches the first few layers, there's barely anything left to learn from. This is the vanishing gradient problem, and it's a big part of why very deep networks used to be difficult to train. The opposite failure mode, exploding gradients, happens when those terms are consistently larger than one. This is critical as well since the model makes large changes and the training diverges.
A skip connection fixes this by giving the gradient a second path back to earlier layers. It bypasses the weight multiplications entirely. Instead of a block computing output = F(x), it computes output = F(x) + x.
During the backward pass, this addition means the gradient has a direct route straight through the identity term, in addition to the route through the block's weights. Even if the weight path's contribution shrinks toward zero, the identity path guarantees a baseline gradient still gets through. The network can't fully lose its training signal just because it's deep.
A Conditional Skip
Unlike a standard ResNet block, the inverted residual's skip connection isn't applied unconditionally, it only fires when the block's stride is 1 and its input and output channel counts match. When a block downsamples or changes channel width, there's no shortcut to add: the input and output tensors are different shapes entirely. The network instead relies on the surrounding stride-1 blocks to keep gradient flow healthy across the depth of the backbone.
Why "Inverted"
A standard ResNet bottleneck compresses its input to a smaller channel count, does its processing, then expands back out. This causes it to be like a sandwich with the middle being narrow and the ends being large/wide. MobileNetV2 flips this arrangement entirely: the inverted residual block expands the channel count first, runs the depthwise convolution in that wider space, then projects back down before the skip connection is added. The heavy computation happens where the representation is wide.
This inversion exists because depthwise convolutions are weak when they only have a few channels to work with since each channel is filtered independently, so there's no cross-channel mixing to compensate for a narrow representation. Expanding first gives the depthwise step a richer space to operate in, even though the block's "ends" stay lightweight.
src/mobilenetv2ssd/models/mobilenet_v2/blocks.py
1class InvertedResidualBlock(tf.keras.layers.Layer):2 def build(self, input_shape):3 input_channel = int(input_shape[-1])4 self.output_channel = self._make_divisible(5 int(round(self.output_channel * self.alpha)), 86 )7 self.expansion_channel = int(input_channel * self.expansion_factor)8 9 if self.expansion_factor != 1:10 # Expand to high-dimensional space11 self.expansion_conv = Conv2D(12 self.expansion_channel, kernel_size=1, use_bias=False13 )14 self.expand_batch_norm = BatchNormalization()15 self.expand_activation_function = ReLU(max_value=6.0)16 17 # Depthwise conv — one filter per channel18 self.depthwise_conv = DepthwiseConv2D(19 kernel_size=3, strides=self.stride, padding="same", use_bias=False20 )21 self.depthwise_batch_norm = BatchNormalization()22 self.depthwise_activation_function = ReLU(max_value=6.0)23 24 # Project back to low-dimensional space (no activation)25 self.projection_conv = Conv2D(self.output_channel, kernel_size=1, use_bias=False)26 self.project_batch_norm = BatchNormalization()27 28 def call(self, x, training=False):29 if self.expansion_conv is not None:30 x = self.expansion_conv(x)31 x = self.expand_batch_norm(x, training=training)32 x = self.expand_activation_function(x)33 34 x = self.depthwise_conv(x)35 x = self.depthwise_batch_norm(x, training=training)36 x = self.depthwise_activation_function(x)37 38 x = self.projection_conv(x)39 x = self.project_batch_norm(x, training=training)40 return xThe full backbone stacks 17 of these blocks with increasing channel widths and strategic stride-2 layers. Four intermediate feature maps — C2, C3, C4, C5 — are extracted as skip connections at strides 4, 8, 16, and 32 respectively. These feed directly into the detection heads.
Width Multiplier
The backbone accepts an alpha parameter that scales every channel count proportionally.
Setting alpha=0.75 cuts the model size by roughly 40%, enabling deployment on tighter hardware like the Hailo-8 with minimal accuracy regression.
The _make_divisible() call ensures all channel counts stay divisible by 8, which is required for efficient tensor operations on most GPU architectures.
Extra Feature Pyramid
Once the MobileNet backbone creates a set of features for the model to use, they still need to go through another step of processing. This step is called the Extra Feature Pyramid Network. This particular head of the model is supposed to take the feature rich encoded tensors from the backbone and use multiple kernels on them to create feature maps for objects of different sizes to be detected.
This allows for multiple objects present in the frame to be detected with a high accuracy even though their scales could differ depending on their positioning or the nature of the object. The implemented model takes the last feature map (C5) and calculates extra features using three levels on top of it.
This makes it so that the SSD looks at feature maps of size 5x5, 3x3 and 2x2 to detect large, medium, and small objects respectively. This Feature Pyramid Network (FPN) is deliberately thin compared to the variants used in RetinaNet or the paper implementing the algorithm since it is meant to calculate meaningful features without adding extra compute and running experiments on these layers delivered the middle ground between accuracy and computation.
src/mobilenetv2ssd/models/ssd/fpn.py
1class ExtraFeaturePyramid(tf.keras.layers.Layer):2 def build(self, input_shape):3 for level, config in enumerate(self.extra_heads_config):4 block = Conv2D(5 filters=config['out_channels'],6 strides=config.get('stride', 2),7 kernel_size=config.get('kernel_size', 3),8 padding="same",9 activation="relu",10 name=f"extra_{level}_conv"11 )12 self.extra_heads.append(block)13 14 def call(self, base_feature, training=False):15 x = base_feature16 extra_features = []17 for block in self.extra_heads:18 x = block(x, training=training)19 extra_features.append(x)20 return extra_featuresImplementation of FPN
The layers are defined using a modular config system allowing for easy stacking of the layers if needed without changing any aspect of the code.
Priors (Anchor Boxes)
The prior boxes define where the model looks for objects. At each cell in each of the six feature maps, a set of boxes is tiled with different scales and aspect ratios. The width and height of each prior are computed as:
Where is the scale at feature level (linearly spaced between s_min and s_max) and is the aspect ratio. An extra "square" prior is added at a geometric mean scale , giving 4–6 anchors per cell depending on the feature level.
src/mobilenetv2ssd/models/ssd/ops/anchor_ops_tf.py
1def anchors_per_cell(scales_for_layer, ratios_for_layer):2 scales = tf.reshape(scales_for_layer, [-1])3 ratios = tf.reshape(ratios_for_layer, [-1])4 5 # Cartesian product of scales × ratios6 scales, ratios = tf.meshgrid(scales, ratios, indexing="xy")7 ratio_sqrt = tf.math.sqrt(ratios)8 9 width = scales * ratio_sqrt10 height = scales / ratio_sqrt11 12 return tf.stack([tf.reshape(width, [-1]), tf.reshape(height, [-1])], axis=1)13 14 15def build_priors(image_size, strides=None, feature_map_shapes=None,16 scales=None, aspect_ratios=None, s_min=None, s_max=None,17 include_extra=True, clip=True):18 if feature_map_shapes is None:19 feature_map_shapes = calculate_feature_map_shapes(image_size, strides)20 21 scales = compute_scales_per_layer(scales, len(feature_map_shapes), s_min, s_max, include_extra)22 ratios = standardize_aspect_ratios(aspect_ratios, len(feature_map_shapes))23 24 prior_layers = [25 build_layer_priors(feature_map_shapes[l], image_size, scales[l], ratios[l])26 for l in range(len(feature_map_shapes))27 ]28 29 priors = concatenate_priors(prior_layers, clip)30 meta = compute_meta(prior_layers, image_size, strides, feature_map_shapes, scales, ratios)31 return priors, metaAnchor Fingerprinting
compute_meta() generates an MD5 hash of the anchor configuration using image size, feature map shapes, scales, and aspect ratios.
This fingerprint is embedded in every checkpoint directory name (e.g., exp001_a1b2c3d4/).
When resuming training from a saved checkpoint, the framework validates the fingerprint before loading weights, preventing silent mismatches when config parameters are changed between runs.
Prediction Heads
The SSD architecture looks at the object in two different perspectives, a classification head cls and a localization head loc. Each of them looks at the object and ask two questions:
- Can this be classified as an object?
- Does this bounding box have an object?
Each level's head output has shape where the channel axis packs every anchor's 4 box values together. This gets reshaped to separate them out, , then flattened into a single anchor list, . Once every level produces this same flat shape, they can be concatenated into one tensor, , so the loss function and NMS can operate on all anchors uniformly regardless of which feature map resolution they came from.
The model shares these feature maps from the FPN for the two prediction heads. This allows for them to use the same pools of information and streamline the execution chain.
The classification head predicts a matrix of probabilities for the objects present in the image using a Softmax classifier. This calculates the probability of the outputs from the classification heads (Logits) into probabilities which can be used by humans and are easily interpretable.
In the practical sense says that there is a probability associated for each label that the model is being trained on. This allows for the best probability be picked for each region.
The localization head predicts four offset values per anchor: . These are normalized displacements from the prior center and log-space scale adjustments. This allows for the model to refine its anchor boxes (priors) by using these adjustments to lock into the object that is being detected. This is crucial since objects can differ in size and putting a big box over a small object defeats the purpose of object detection system if most of the region is background. This is taken care of in the priors that are defined in the model, the learning aspect is how they are refined to capture the object.
The offsets on their own are not enough to infer how the model is working since they need to be converted into box coordinates for a person to observe.
The code underneath shows the way the classification and localization head are implemented.
src/mobilenetv2ssd/models/ssd/ops/heads_tf.py
1class LocalizationHead(tf.keras.Layer):2 def call(self, feature_maps, training=False):3 outputs = []4 for layer, feature_map in enumerate(feature_maps):5 num_anchors = self.num_anchors_per_layer[layer]6 x = self.heads[layer](feature_map, training=training)7 8 B = tf.shape(x)[0]9 H, W = tf.shape(x)[1], tf.shape(x)[2]10 x = tf.reshape(x, [B, H * W * num_anchors, 4])11 outputs.append(x)12 13 return tf.concat(outputs, axis=1) # [B, N_total, 4]14 15 16class ClassificationHead(tf.keras.Layer):17 def call(self, feature_maps, training=False):18 outputs = []19 for layer, feature_map in enumerate(feature_maps):20 num_anchors = self.num_anchors_per_location[layer]21 C = self.number_of_classes22 x = self.final_heads[layer](feature_map, training=training)23 24 B = tf.shape(x)[0]25 H, W = tf.shape(x)[1], tf.shape(x)[2]26 x = tf.reshape(x, [B, H * W * num_anchors, C])27 outputs.append(x)28 29 return tf.concat(outputs, axis=1) # [B, N_total, num_classes]Data Pipeline
A model is only as good as what it is shown during training, and object detection has a few data handling problems that do not show up in simpler classification tasks. The pipeline that feeds this model is built around PASCAL VOC 2012, a 20-class benchmark with roughly 17,000 images, each one annotated with bounding boxes stored as XML files alongside the JPEGs. A fixed split file lists which image ids belong to training versus validation, so every run is evaluated against the same held out set.
VOC to GPU Data Pipeline
Before any image reaches the model, it passes through an augmentation chain split into two categories. Photometric transforms change pixel values without touching the boxes at all, things like random brightness, contrast, saturation, and hue jitter. Geometric transforms are the harder case, since anything that changes the image layout, a horizontal flip or a resize, has to update the box coordinates in lockstep with the image. Getting this wrong does not throw an error, it just silently teaches the model the wrong boxes.
Why Boxes Need Clipping
After a crop or resize, a box can shrink down to a sliver of its original area or get pushed outside the image entirely. If that box is left in the batch, it still contributes to the loss with a target that no longer makes sense. A clip and filter step runs after every geometric transform to clamp coordinates to the image boundaries and drop any box that has shrunk below a minimum pixel area.
Object detection batches are also awkward to build in the first place, since every image can have a different number of ground truth boxes. The pipeline solves this with padded batching, where each sample's boxes and labels are padded out to a fixed maximum length using sentinel values, and a validity mask is computed alongside the batch so that target assignment and the loss function both know which entries are real objects and which ones are just padding.
TFRecords for I/O Bound Training
When training on cloud instances with slower network storage, the data pipeline becomes the bottleneck before the GPU does. The pipeline supports pre-serializing the dataset into sharded TFRecord files as a drop in replacement for the generator path, which cuts down filesystem overhead and keeps the GPU fed.
Training
The architecture defined above does not know how to detect anything on its own. Before any of these weights are useful, the model needs to be trained against a defined objective, and that starts with deciding what every single anchor is actually looking at.
Each of the roughly 8,732 anchors gets matched against the ground truth boxes in an image using Intersection over Union (IoU). Anchors with a high enough IoU against a ground truth box are labeled as positive matches for that object, and anchors that fall below a lower threshold are labeled as background. To avoid leaving any object without a match, the single best anchor for each ground truth box is always assigned positively, even if its IoU happens to fall below the usual threshold.
This matching step immediately creates an imbalance problem. A typical training image might have a handful of real objects, but thousands of anchors covering the image, so the overwhelming majority end up labeled background. Left unchecked, this would let the loss get dominated by background predictions instead of the objects that actually matter. This is corrected with hard negative mining, which is covered in detail in the dedicated training post along with the rest of this process.
Once every anchor has a label, training optimizes a single combined loss that scores the model on two separate jobs at once: how well it classifies each anchor, and how well it predicts the box offsets for the anchors that matched a real object.
MobileNetV2-SSD Training Loop (High-Level)
Trained this way on PASCAL VOC 2012, the model reached 77% mAP at an IoU threshold of 0.5. The full mechanics behind target assignment, the loss derivation, mixed precision training, and weight averaging are each covered in their own dedicated posts.
Deployment
The entire point of the design decisions made throughout this post, the lightweight backbone, the thin feature pyramid, the FP16 friendly operations, comes down to this step. A model that hits 77% mAP in a notebook is not useful if it cannot actually run on the Jetson it was built for.
Getting there is a short chain of conversions rather than a single step. A trained checkpoint is exported to a TensorFlow SavedModel, the SavedModel is converted to ONNX, and the ONNX model is then quantized down to INT8 using static calibration against a set of real images, producing a model that TensorRT can compile and run efficiently on the Jetson's hardware.
Checkpoint to Jetson Export Pipeline
The INT8 Tradeoff
Quantizing down to INT8 shrinks the model and speeds up inference substantially on edge hardware, at the cost of some numerical precision compared to FP32 or FP16. Calibrating against real images during quantization keeps the accuracy loss small, rather than quantizing blindly and hoping the model still performs.
The exported model is also built to be self contained. Normalization and box decoding are baked directly into the serving wrapper, so the deployed model takes a raw image in and returns decoded boxes and class scores directly, with no extra postprocessing required on the device itself.
The full export pipeline, the validation step that checks ONNX output shapes before quantization, and the dual environment setup needed since the TensorFlow and ONNX tooling do not coexist cleanly, are covered in a dedicated deployment post.
Conclusion
Putting these pieces together creates not just a model, but a full pipeline that trains it and deploys it onto the Jetson in ONNX format to make use of its CUDA cores. I also implemented a fully fledged MLOps system. The other pieces are explained in my separate articles on the blog.
Results at a Glance
77% mAP@0.5 on PASCAL VOC 2012 · ~8× parameter reduction vs standard convolutions · Deploys to TensorRT FP16 for real-time inference on Jetson Orin Nano
Related articles