MITE-Net: SWaP-Optimized 4K Video Tiny Target Perception for Embodied Edge SAR

Xu, Mingshuo; Hua, Mu; Peng, Jigen; Wang, Qi; Yue, Shigang

This paper is currently under review

The submission is under evaluation. Please check back for updates.

MITE-Net: SWaP-Optimized 4K Video Tiny Target Perception for Embodied Edge SAR

Mingshuo Xu^1†, Mu Hua^1†, Jigen Peng², Qi Wang^1*, and Shigang Yue^1*, Senior Member, IEEE

¹School of Mathematics and Computing Science, University of Leicester, Leicester LE1 7RH, UK ²Machine Life and Intelligence Research Center, Guangzhou University, Guangzhou 510006, China
^†Equal contribution ^*Corresponding author

Paper Supplementary Code arXiv

Abstract

Real-time tiny target perception in high-resolution imagery is critical for embodied Search-and-Rescue (SAR) missions. However, strict Size, Weight, and Power (SWaP) constraints on edge devices like UAVs create a bottleneck: traditional image downsampling causes severe feature loss, while slice-based processing incurs prohibitive latency. To address this gap, this paper introduces a comprehensive framework encompassing a novel architecture, specialized datasets, and hardware-level benchmarks. First, we propose MITE-Net (Motion-Informed Tiny-target Edge Network), a SWaP-optimized cascaded architecture, which couples a bio-inspired, learning-free Tiny Target Motion-Based Region Proposal Network (TTM-RPN) with a sub-0.14M-parameter R-CNN-like head. Second, to standardize 4K tiny target evaluation, we construct the SAR-Tiny Datasets by relabeling two challenging UAV datasets: SeaDroneSee-Tiny (dynamic maritime scenes, tiny targets predominantly of 64–256 pixels) and UAVID-Tiny (cluttered urban scenes, extremely tiny targets, ≤ 64 pixels). Third, we benchmark against state-of-the-art YOLO models on an edge device, NVIDIA Jetson AGX Xavier, where MITE-Net directly processes 4K maritime imagery, achieving a 100% search success rate at 30.33 FPS. Consuming merely 3.19 W (9.51 FPS/W), MITE-Net vastly outperforms YOLO baselines in target recall and energy efficiency. Conversely, UAVID-Tiny evaluations expose a compound structural limitation: the learning-free bionic front-end struggles against urban backgrounds, while the ultra-lightweight head lacks representational capacity for complex features. Ultimately, this work delivers an efficient onboard perception paradigm and a rigorous baseline guiding future end-to-end SAR architectures.

MITE-Net Architecture

Overall architecture of the proposed MITE-Net. The cascaded framework consists of two stages. Left: A learning-free Bionic-Driven TTM-RPN utilizes vSTMD to extract motion-salient RoIs, which are cropped and concatenated with spatial coordinates into multi-channel inputs. Right: An ultra-lightweight R-CNN-like Head (<0.14M parameters) processes these inputs using dilated and group convolutions, followed by a dual-branch channel attention structure for precise classification and bounding box regression.

Size selectivity curves of the TTM-RPN for target widths of (A) 4 pixels, (B) 12 pixels, and (C) 16 pixels. By configuring the suppression kernel size to correspond with the input target dimensions, the model consistently achieves its peak relative response when the target height matches the specified target width (pink shaded "Prefer Region"). The qualitative trend remains robust across various pooling parameters (ds ∈ {1, 2, 4}).

SAR-Tiny Datasets

Dataset Statistics

Statistical overview of the curated SeaDroneSee-Tiny dataset. (A) Object area frequency distribution showing a heavily right-skewed trend, where the majority of targets are Extremely Tiny (ET, Area ≤ 64 px²) and Tiny (64 < Area ≤ 256 px²). (B) Joint distribution of target width and height across train/val/test splits, partitioned by three scale regimes. The concentrated test-set contour in the ET and Tiny regions highlights the deliberate "stress test" design. (C) An example frame illustrating detection of multiple scattered targets: GT 1 (swimmer) and GT 2 (swimmer with life jacket).

Statistical overview of the curated UAVID-Tiny dataset. (A) Object area frequency distribution with a massive concentration strictly within the Extremely Tiny (ET, Area ≤ 64 px²) region. (B) Joint distribution showing deep penetration into the ET region, underscoring the dataset's complexity. (C) Example test frame of minute targets in cluttered urban background: GT 1 (pedestrians), GT 2 (moving cars), and GT 3 (static cars).

Dataset Summary

Dataset	Split	Sequences	Frames	BBoxes	Target Density	Key Characteristics
SeaDroneSee-Tiny	Train	seq 2–8	3,858	7,178	1–5 / frame	Dynamic maritime backgrounds.
	Val	seq 9	1,001	3,003	3 / frame	Baseline maritime background evaluation.
	Test	seq 1	1,001	11,243	12 / frame	Density mimicking real-world SAR crises.
UAVID-Tiny	Train	seq 2,5,7,8,16,17,33	3,208	63,398	10–50 / frame	Massive urban clutter.
	Val	seq 24	901	15,063	~16 / frame	Occlusion and varied lighting conditions.
	Test	seq 23	701	9,839	~14 / frame	Early-stage target discovery in urban scenes.

Dataset Demonstration & More Details — Video samples, annotation visualizations, and full dataset documentation are available at the dedicated dataset page: SAR-Tiny-HomePage

Experimental Results

SAR Mission-Level Benchmark

		YOLOv11n Resize 1920×1088	YOLOv8n-P2 Resize 1920×1088	YOLOv8n-P2 (SAHI) 1344×768×9	MITE-Net (Ours) Raw 4K + 2ds	MITE-Net (Ours) Raw 4K + 8ds
SeaDroneSee-Tiny	Search Success Rate (%, ↑)	83.33	83.33	100	100	83.33
	False Alarm Rate (%, ↓)	6.83	16.77	3.12	67.72	82.29
	Max Search Time (Frame, ↓)	126	74	385	299	119
	Avg. Search Time (Frame, ↓)	17.90	11.20	62.58	69.00	36.80
UAVID-Tiny	Search Success Rate (%, ↑)	11.11	22.22	36.11	8.33	0.00
	False Alarm Rate (%, ↓)	36.90	54.87	21.35	99.39	99.98
	Gray values denote severe algorithmic degradation. MITE-Net excels in maritime SAR (100% SSR) but encounters structural limitations in hyper-cluttered urban scenes.

SWaP	Parameters (M, ↓)	2.59	3.01	2.93	0.14	0.14
	Latency (ms, Batch=1, ↓)	34.31	32.03	208.68	32.97	19.11
	Inference Power (W, ↓)	9.91	10.15	13.53	3.19	1.41
	Efficiency (FPS/W, ↑)	2.94	3.08	0.35	9.51	13.54

Inference on NVIDIA Jetson AGX Xavier (Float16). Bold = best, underline = second best.

Qualitative Detection Results

Qualitative demonstration of 4K tiny target perception for the SAR mission using the proposed MITE-Net and baseline models. The top row displays the original 4K resolution images, while the bottom panels show the corresponding zoomed-in regions. Due to the extremely tiny scale of the targets, they appear merely as blurry spots even after significant magnification. The first column shows results on SeaDroneSee-Tiny, where MITE-Net demonstrates excellent detection performance. The second column shows a classification failure case on UAVID-Tiny. The third column shows a bounding box regression failure when detecting multiple tiny objects. Notably, when targets consist of merely a dozen pixels, the recall rate of mainstream YOLO baselines also drops to near zero.

Comparison of search success rate versus processing speed (FPS) among different models. The green shaded area denotes the real-time processing region (≥ 30 FPS), and the annotated numbers indicate the efficiency metric (FPS/W) for each tested configuration. Our proposed MITE-Net achieves an optimal trade-off, maintaining a 100% search success rate within the real-time processing region while demonstrating significantly higher efficiency (12.9×) over the best baseline.

Qualitative Detection Videos

Comparison of MITE-Net with YOLOv11n and YOLOv8n-p2 baselines on both SeaDroneSee-Tiny (maritime) and UAVID-Tiny (urban) test sets. Use the carousel navigation to explore different models and datasets.

SeaDroneSee-Tiny · MITE-Net

SeaDroneSee-Tiny · YOLOv11n

SeaDroneSee-Tiny · YOLOv8n-p2

UAVID-Tiny · MITE-Net

UAVID-Tiny · YOLOv11n

UAVID-Tiny · YOLOv8n-p2

Poster (Generlized by Google NotebookLLM with Gemini)

BibTeX

@article{xu2026mitenet,
  title={MITE-Net: SWaP-Optimized 4K Video Tiny Target Perception for Embodied Edge SAR},
  author={Xu, Mingshuo and Hua, Mu and Peng, Jigen and Wang, Qi and Yue, Shigang},
  journal={arXiv preprint arXiv:},
  year={2026}
}