Akhilesh Warty

Most object detection projects stop at "does it work." SkylarkOS had a harder constraint: the model's output drives a drone's flight controller in real time, so every stage of the pipeline has to fit inside a strict latency budget or the drone reacts to stale information.

This post covers the perception side of SkylarkOS, a ROS2-based autonomous UAV system I built running on an NVIDIA Jetson Orin Nano: the models chosen for detection, tracking, identity, and pose, why each one was picked over the more obvious alternative, and the actual latency numbers that came out of running all of them together on real hardware.

Results at a Glance

~11ms camera-to-velocity-setpoint latency · 62fps sustained throughput at 25W · 5 models running concurrently on a single Jetson Orin Nano

What This Demonstrates

Designed and profiled a multi-model real-time inference pipeline against a hard 16.7ms-per-frame latency budget
Made deliberate model and architecture tradeoffs (SORT vs. DeepSORT) backed by measured cost, not assumption
Deployed the same models through two different runtimes, ONNX Runtime for development and TensorRT FP16 for production, with no code changes to the calling nodes
Validated the full pipeline end-to-end in both simulation (SITL) and on physical edge hardware
Diagnosed and fixed production-grade reliability bugs (QoS misconfiguration causing silent staleness, buffer-ownership races in the streaming pipeline, GPU/library version mismatches) that only surfaced under sustained runtime, not short tests

System Architecture

The full system is a chain of ROS2 nodes, each one owning a single responsibility and passing its output downstream as a topic. This post focuses on the perception half, everything from the camera through tracking, identity, and gesture. The control and telemetry nodes at the bottom are covered in a future post.

SkylarkOS Node Graph

Camera / Video Simimage_raw

skylark_perceptionYOLO11n + NMS

skylark_streamingGStreamer MJPEG

skylark_trackingSORT: Kalman + Hungarian

skylark_identityFace lock + ReID enrollment

skylark_gestureYOLO11n-pose + debounce

skylark_controlPID velocity control

uXRCE-DDS AgentROS2 ↔ PX4 bridge

PX4 Flight ControllerPosition, attitude, motors

skylark_telemetryWebSocket :8765

Ground StationPhone / Laptop

Scope of This Post

skylark_control, the uXRCE-DDS bridge to PX4, and skylark_telemetry are covered in a dedicated post on the perception-to-control bridge. This post stays focused on everything above the skylark_control node: detection, tracking, identity, and gesture.

The Budget That Drives Every Decision

At 60fps, each frame has to be fully processed in 16.7ms or the pipeline falls behind. That number is the constraint behind every model and design choice in this post, not "what's the most accurate model available," but "what's the most accurate model that still fits."

Measured End-to-End Latency

~11ms camera-to-velocity-setpoint · 62fps sustained throughput · measured on a Jetson Orin Nano Super at 25W, 720p input, TensorRT FP16

Per-Stage Latency Breakdown

Hitting an 11ms budget only happens if every individual stage is measured, not assumed. Here's what each part of the pipeline actually costs:

Stage	Latency	Notes
YOLO11n detection	3.2ms	TensorRT FP16, 640×640
SORT tracking	0.8ms	CPU, Kalman + Hungarian
OSNet ReID	2.1ms	TensorRT FP16, 256×128 crop
YOLO11n-pose	3.8ms	TensorRT FP16, 640×640 bbox crop
End-to-end pipeline	~11ms	Camera → velocity setpoint

Why Measure Per-Stage at All?

A single end-to-end number tells you the system works. Per-stage numbers tell you where to optimize next, and which stages have headroom left if a future feature needs to borrow some of the budget.

The Models

Five separate models run in this pipeline, each picked for a specific job rather than reusing one general-purpose model everywhere:

Model	Purpose	Input	Output
YOLO11n	Person detection	640×640 NCHW	Bounding boxes + confidence
ArcFace MBF	Face recognition	112×112 face crop	512-D embedding
SCRFD	Face detection	Variable	Face bounding boxes
OSNet x0.25	Person re-identification	256×128 body crop	512-D appearance embedding
YOLO11n-pose	Pose estimation	640×640 NCHW	17 COCO keypoints + confidence

All five run through ONNX Runtime during development (CPU/CUDA) and switch to the TensorRT execution provider with FP16 precision on the Jetson, with no code change required to make that switch. The model and runtime are decoupled from the ROS2 nodes that call them.

Why These Models, Specifically

Each model in this table was picked against the same constraint as everything else in this post: it has to share the Jetson's compute budget with four other models and still leave room for the 16.7ms frame deadline.

YOLO11n was chosen over larger YOLO11 variants (s/m/l) because the perception stack isn't running detection alone, it's sharing the same hardware with pose estimation, ReID, and face recognition every frame. A bigger detector might score higher on accuracy benchmarks in isolation, but it would eat into the latency budget that YOLO11n-pose and OSNet also need, so the right model here is the smallest one that still detects a person reliably at flight altitude and distance, not the most accurate one available.

SCRFD was chosen for face detection because face lock has to work across whatever lighting the owner happens to be standing in outdoors, not just controlled indoor conditions. SCRFD holds up across a wider range of lighting and pose variation than smaller, cheaper face detectors, which matters more here than shaving off another fraction of a millisecond, since a missed face detection means a failed owner lock at the start of a flight.

OSNet x0.25 was chosen over larger ReID backbones because re-identification only needs to run on a single locked track, not every object in the frame, which changes the cost-benefit calculation entirely. At x0.25 width, OSNet is light enough to re-enroll on the Jetson at the start of every flight without competing for the same compute the detector and pose model need, while still producing an appearance embedding that's accurate enough to keep tracking the right person through a flight.

These Are a Starting Point, Not the End State

All five models here are existing open architectures, chosen specifically because they're well understood and let the rest of the system (tracking, identity logic, gesture debounce, control) get built and validated quickly. The plan is to replace them with custom models trained from scratch on SkylarkOS-specific data, the same way the MobileNetV2-SSD detector covered in a separate post was built from the ground up rather than adopted off the shelf. Using proven models first made it possible to prove the full pipeline works end to end before sinking time into training models for a problem that wasn't fully defined yet.

Validated in Two Environments

The full perception pipeline runs at 25–30fps in SITL (WSL2, CPU-only ONNX Runtime on an Intel i5-9400) and 62fps on the Jetson with TensorRT FP16. Face lock, ReID enrollment, gesture detection, and owner following were all confirmed end-to-end in simulation before ever touching physical hardware.

Design Decisions

Why SORT Over DeepSORT?

DeepSORT bundles its own appearance embedding model into the tracker, running it on every tracked object every frame. SORT runs in under 1ms on CPU using just a Kalman filter and the Hungarian algorithm for detection-to-track association, with no learned appearance model at all. The appearance matching this pipeline actually needs is handled separately, by OSNet, and only for the one track that's locked as the owner, not for every track in the frame. Splitting tracking and re-identification into two stages, rather than using DeepSORT's combined approach, means the expensive appearance model only runs once instead of once per tracked object.

Why Re-Enroll the ReID Embedding Every Flight?

OSNet's appearance embeddings are sensitive to lighting and clothing. A face embedding can be enrolled once and reused indefinitely, but an appearance embedding can't. Instead of storing a fixed appearance embedding, the system re-enrolls it from live crops at the start of every flight, after the face-lock stage confirms identity. This trades a few seconds of setup time per flight for an appearance model that's accurate to what the owner is actually wearing that day, rather than slowly degrading in accuracy as it drifts from a stale stored embedding.

Why Debounce Gesture Detection?

Running pose estimation on a moving bounding box crop produces noisy, frame-to-frame keypoint jitter. A single noisy frame misclassified as a gesture would be enough to trigger an unwanted flight command. The fix is a debounce filter: a gesture command is only published after N consecutive frames agree on the same detected gesture (default 3). At 30fps, that costs roughly 100ms of latency on gesture commands specifically, which is acceptable since gestures are discrete commands, not a continuous control signal that needs the same 16.7ms budget as detection and tracking.

Engineering for a System That Has to Stay Up

A pipeline that produces the right bounding boxes in a notebook is a different problem from a pipeline that has to run unattended on a drone for the length of a flight. Most of the engineering effort in SkylarkOS went into the second problem.

Lifecycle-Managed Nodes, Not Just Running Processes

Every perception node (skylark_perception, skylark_tracking, skylark_identity) is built as a ROS2 managed lifecycle node rather than a plain node that starts inferring the moment it spins up. Each one has an explicit configure → activate → deactivate → cleanup state machine, so model loading, GPU memory allocation, and ONNX session creation happen during configuration, not at the first incoming frame. That means a node can be brought up, held in an inactive state while the rest of the system finishes initializing, and only flipped active once the flight controller is actually ready for perception data, instead of racing other nodes at boot and silently dropping the first several seconds of frames.

QoS Tuned for a Camera Feed, Not a Database

ROS2's default QoS profile is built for reliability, every message gets delivered, retried if needed. That is the wrong tradeoff for a 60fps video pipeline: a dropped frame from two cycles ago is worse than useless, it is actively stale. Every image and detection topic in the system runs on a BEST_EFFORT, volatile QoS profile instead, so a slow subscriber drops old frames rather than backing up the publisher and introducing exactly the kind of latency the 16.7ms budget can't absorb.

One Pipeline, Two Runtimes, Zero Code Changes

The ONNX Runtime to TensorRT swap mentioned earlier wasn't a side effect of luck, it was a deliberate boundary drawn at design time. Every model-calling node talks to a thin inference wrapper that takes a runtime/precision flag at startup; nothing above that wrapper knows or cares whether it's running CPU ONNX Runtime in WSL2 or the TensorRT FP16 execution provider on Jetson hardware. That decoupling is what made it possible to develop and debug the entire perception stack, including face lock, gesture recognition, and owner tracking, on a laptop in simulation, then deploy the exact same code to the Jetson and only change a build flag.

Validating Before Trusting Real Hardware

Every behavior in this pipeline, face-lock acquisition, ReID re-enrollment, gesture debounce thresholds, owner-following logic, was exercised end-to-end against PX4 SITL before it ever ran against a physical flight controller. That meant catching a wrong threshold or a race condition between nodes was a five-minute fix in simulation instead of a re-flash-and-re-fly cycle, and it meant the first time the system touched real hardware, the only unknowns left were physical ones (vibration, lighting, latency over an actual radio link), not logic bugs.

Three Bugs That Were Worse Than They Looked

Clean architecture diagrams hide the actual cost of getting there. A few problems in this project looked like one-line fixes and turned out to be multi-day investigations.

The Stream That Worked, Until It Didn't

The first version of the GStreamer pipeline in skylark_streaming looked fine in every short test: start the node, watch the MJPEG feed in a browser, see annotated tracking boxes, call it done. It fell apart on longer runs, the stream would freeze or drop frames after a few minutes with no error in the node's own logs. The actual cause was a buffer ownership mismatch between the encode thread and the GStreamer appsrc element: frames were being written into a buffer that the pipeline hadn't finished consuming yet, so under sustained load the queue backed up silently instead of failing loudly. The fix was switching to a bounded queue with explicit drop-oldest behavior between the annotation thread and the GStreamer source, the same "stale data is worse than no data" principle that later shaped the QoS decisions across the rest of the system.

A Container That Built, But Didn't Work

Getting Jetpack's CUDA, cuDNN, and TensorRT versions to actually agree with ONNX Runtime's GPU build inside a Docker container was its own project. The container would build clean and run, but inference would silently fall back to CPU, or crash a layer deep into a TensorRT engine build with an error that pointed at the wrong library. Tracking that down meant pinning exact versions across four interdependent libraries (Jetpack's base image, CUDA, cuDNN, TensorRT) instead of trusting "latest" tags, and verifying the GPU execution provider was actually active at runtime rather than just checking that the container started without errors. A container that builds is not the same as a container that runs the workload it was built for.

Frames Were Arriving, Detections Were Stale

The QoS mismatch was the most deceptive bug in the whole stack because every individual node looked correct in isolation. skylark_perception was publishing on a reliable QoS profile by default, and downstream subscribers were happily receiving every message, just later than they should have, because DDS was queuing and retrying delivery instead of dropping anything. The system never crashed and never logged an error; it just got progressively laggier the longer it ran, which made it look like a performance problem rather than a configuration one. Switching every image and detection topic to BEST_EFFORT, volatile QoS fixed it immediately, and it's the reason that QoS choice is called out explicitly earlier in this post: it wasn't a default left alone, it was a bug found the hard way.

Holding 60fps, Not Just Hitting It Once

Getting a single frame through the pipeline fast was easy. Holding a consistent 60fps over a multi-minute run, with lifecycle transitions, ReID re-enrollment, and gesture debounce all running concurrently, was not. Early versions had visible frame pacing jitter: individually fast stages that still produced an uneven output rate because of contention between the inference thread and the encode/annotation thread sharing the same image buffer. Fixing this meant treating thread scheduling and buffer access patterns as seriously as the model latency numbers in the table above; a pipeline that averages 60fps with high variance behaves worse for a flight controller than one that holds a steady 50fps, because the control loop downstream is tuned for consistent timing, not just a good average.

Conclusion

Key Takeaways

Hitting an 11ms perception pipeline on edge hardware wasn't the result of one clever optimization, it came from treating the 16.7ms frame budget as a hard constraint on every individual model and design choice: picking the cheaper tracker when the expensive one's extra cost bought nothing extra here, and measuring every stage individually rather than trusting an aggregate number. The same discipline that goes into choosing a model architecture has to extend to how it's deployed and measured, or the accuracy on paper never makes it into the system that ships.