This paper is currently under review
The submission is under evaluation. Please check back for updates.

MITE-Net: SWaP-Optimized 4K Video Tiny Target Perception for Embodied Edge SAR

Mingshuo Xu1†, Mu Hua1†, Jigen Peng2, Qi Wang1*, and Shigang Yue1*, Senior Member, IEEE
1School of Mathematics and Computing Science, University of Leicester, Leicester LE1 7RH, UK 2Machine Life and Intelligence Research Center, Guangzhou University, Guangzhou 510006, China
†Equal contribution    *Corresponding author

Abstract

Real-time tiny target perception in high-resolution imagery is critical for embodied Search-and-Rescue (SAR) missions. However, strict Size, Weight, and Power (SWaP) constraints on edge devices like UAVs create a bottleneck: traditional image downsampling causes severe feature loss, while slice-based processing incurs prohibitive latency. To address this gap, this paper introduces a comprehensive framework encompassing a novel architecture, specialized datasets, and hardware-level benchmarks. First, we propose MITE-Net (Motion-Informed Tiny-target Edge Network), a SWaP-optimized cascaded architecture, which couples a bio-inspired, learning-free Tiny Target Motion-Based Region Proposal Network (TTM-RPN) with a sub-0.14M-parameter R-CNN-like head. Second, to standardize 4K tiny target evaluation, we construct the SAR-Tiny Datasets by relabeling two challenging UAV datasets: SeaDroneSee-Tiny (dynamic maritime scenes, tiny targets predominantly of 64–256 pixels) and UAVID-Tiny (cluttered urban scenes, extremely tiny targets, ≤ 64 pixels). Third, we benchmark against state-of-the-art YOLO models on an edge device, NVIDIA Jetson AGX Xavier, where MITE-Net directly processes 4K maritime imagery, achieving a 100% search success rate at 30.33 FPS. Consuming merely 3.19 W (9.51 FPS/W), MITE-Net vastly outperforms YOLO baselines in target recall and energy efficiency. Conversely, UAVID-Tiny evaluations expose a compound structural limitation: the learning-free bionic front-end struggles against urban backgrounds, while the ultra-lightweight head lacks representational capacity for complex features. Ultimately, this work delivers an efficient onboard perception paradigm and a rigorous baseline guiding future end-to-end SAR architectures.

MITE-Net Architecture

SAR-Tiny Datasets

Dataset Statistics

Dataset Summary

Dataset Split Sequences Frames BBoxes Target Density Key Characteristics
SeaDroneSee-Tiny Train seq 2–8 3,858 7,178 1–5 / frame Dynamic maritime backgrounds.
Val seq 9 1,001 3,003 3 / frame Baseline maritime background evaluation.
Test seq 1 1,001 11,243 12 / frame Density mimicking real-world SAR crises.
UAVID-Tiny Train seq 2,5,7,8,16,17,33 3,208 63,398 10–50 / frame Massive urban clutter.
Val seq 24 901 15,063 ~16 / frame Occlusion and varied lighting conditions.
Test seq 23 701 9,839 ~14 / frame Early-stage target discovery in urban scenes.

Experimental Results

SAR Mission-Level Benchmark

YOLOv11n
Resize 1920×1088
YOLOv8n-P2
Resize 1920×1088
YOLOv8n-P2 (SAHI)
1344×768×9
MITE-Net (Ours)
Raw 4K + 2ds
MITE-Net (Ours)
Raw 4K + 8ds
SeaDroneSee-Tiny Search Success Rate (%, ↑) 83.3383.3310010083.33
False Alarm Rate (%, ↓) 6.8316.773.1267.7282.29
Max Search Time (Frame, ↓) 12674385299119
Avg. Search Time (Frame, ↓) 17.9011.2062.5869.0036.80
UAVID-Tiny Search Success Rate (%, ↑) 11.1122.2236.118.330.00
False Alarm Rate (%, ↓) 36.9054.8721.3599.3999.98
Gray values denote severe algorithmic degradation. MITE-Net excels in maritime SAR (100% SSR) but encounters structural limitations in hyper-cluttered urban scenes.
SWaP Parameters (M, ↓) 2.593.012.930.140.14
Latency (ms, Batch=1, ↓) 34.3132.03208.6832.9719.11
Inference Power (W, ↓) 9.9110.1513.533.191.41
Efficiency (FPS/W, ↑) 2.943.080.359.5113.54

Inference on NVIDIA Jetson AGX Xavier (Float16). Bold = best, underline = second best.

Qualitative Detection Results

Qualitative Detection Videos

Comparison of MITE-Net with YOLOv11n and YOLOv8n-p2 baselines on both SeaDroneSee-Tiny (maritime) and UAVID-Tiny (urban) test sets. Use the carousel navigation to explore different models and datasets.

Poster (Generlized by Google NotebookLLM with Gemini)

BibTeX

@article{xu2026mitenet,
  title={MITE-Net: SWaP-Optimized 4K Video Tiny Target Perception for Embodied Edge SAR},
  author={Xu, Mingshuo and Hua, Mu and Peng, Jigen and Wang, Qi and Yue, Shigang},
  journal={arXiv preprint arXiv:},
  year={2026}
}