Open-vocabulary perception pipeline for aerial asset inspection
Reference implementation of a two-stage perception stack applied across seven drone-captured inspection scenarios spanning telecom towers, photovoltaic arrays, high-voltage transmission infrastructure, wind turbine blades, and construction safety compliance. Stage 1 produces per-frame open-vocabulary bounding-box detections using a 3B-parameter vision-language model with Parallel Box Decoding (PBD); the detection vocabulary is set per-prompt at inference time, with no model retraining between domains. Stage 2 reconstructs a sparse 3D point cloud over the same frames using a feed-forward geometry transformer (no SfM, no explicit camera optimization).
The same model that locates antenna panels, mounting brackets, and corrosion on a telecom tower also locates soiled solar modules, flashover damage on insulators, leading-edge erosion on wind blades, and PPE compliance on a construction site. Domain adaptation is a natural-language prompt, not a finetune. Model outputs are produced on AWS spot GPU and served as static JSON / PLY artifacts; bounding-box overlays are drawn client-side in Canvas with track-id continuity interpolated between sampled detection frames.
#Methodology
Detection
Each input video is downsampled to 2 fps. Frames are passed through LocateAnything-3B with two prompts per frame: a component-detection prompt over a fixed 8-class vocabulary, and a damage-grounding prompt iterated across 6 natural-language defect descriptions. Output token format is parsed via regex against the <box> protocol.
# domain vocabulary is set at inference, not training prompts.telecom = "antenna panel, mounting bracket, coaxial cable, RRU, cable tray, lightning rod, structural crossarm" prompts.solar = "solar panel, junction box, MC4 connector, mounting rail, conduit" prompts.utility = "suspension insulator, conductor, crossarm, transmission tower, vibration damper" prompts.wind = "turbine blade, blade tip, leading edge, trailing edge, lightning receptor" prompts.safety = "worker, hard hat, safety vest, harness, scaffolding, excavator" # each call to detect(img, prompts[vertical]) uses # the same weights, the same forward pass, no finetune
Tracking
Adjacent sampled frames are associated via Hungarian-optimal IoU matching restricted to same-label detections. A persistent track_id is assigned per object; tracks decay after 1.5 seconds without a match. The browser interpolates bounding-box positions linearly between samples using shared track_ids and tweens at the video render rate.
for each frame_t in samples: cost[i,j] = 1 - IoU(det[i].bbox, track[j].last_bbox) assignment = linear_sum_assignment(cost, threshold=0.3) apply(assignment); spawn new tracks for unmatched
3D reconstruction
24 keyframes are sampled uniformly across the video and passed through VGGT-1B in a single forward pass. The model outputs world-space point maps, camera intrinsics + extrinsics, and depth maps. We retain the top-50% of points by confidence, recenter to the centroid, and normalize the scale to unit at p95 distance. Cloud is downsampled to 80k points before serialization.
out = VGGT(frames) pts = out.world_points.reshape(-1, 3) mask = out.confidence > p50(confidence) pts = (pts[mask] - mean) / p95(norm) write_ply(pts[random_sample(80_000)])
Serving
All outputs are written to S3 as static objects with a 24h cache TTL. The browser fetches detections.json on demand-modal open and renders bounding boxes via 2D Canvas in a requestAnimationFrame loop synced to video.currentTime. PLY clouds are loaded lazily via three.js PLYLoader.
s3://aerioai-demo-471176250120/
├── videos/{id}.mp4 # input
├── detections/{id}.json # 2 fps frames + tracks
└── 3d/{id}.ply # sparse point cloud
{id}.poses.json # camera extrinsics#Pipeline
#Runs
Seven input videos processed end-to-end across five inspection verticals using a single unchanged model and pipeline. Click a row to view artifacts.
| id | label | vertical | source | duration | res | notes | |
|---|---|---|---|---|---|---|---|
| demo-001 | tower_cu_01 | telecom | youtube/TFCwRzQ5lVI | 3m19s | 1280×720 | close-orbit DJI Mavic Pro, single tower, antenna cluster + RRU | |
| demo-002 | tower_wide_02 | telecom | youtube | 1m50s | 1280×720 | wide-shot monopole, sky + rural field background | |
| demo-003 | tower_topdown_03 | telecom | youtube | 1m15s | 1280×720 | top-down view of antenna cluster — unique angle, radial panel layout | |
| demo-004 | solar_pv_01 | solar | youtube | 1m47s | 1280×720 | PV farm aerial pass — soiling, hotspot, cracked-cell flags | |
| demo-005 | trans_line_01 | utility | youtube | 1m03s | 1280×720 | HV transmission tower aerial — insulator string + structural members | |
| demo-006 | wind_blade_01 | renewables | youtube/AV1isLII4TA | 0m27s | 720×1280 | DJI Matrice 4T 112x zoom — leading-edge erosion + lightning strike | |
| demo-007 | demo_site_07 | construction | youtube | 1m16s | 1280×720 | demolition interior — excavator + structural columns + debris |
#Live inference
The model that produced every overlay above is also exposed as a hosted endpoint. Single-image inference takes 5-8 seconds; the same prompts that drive the pre-loaded runs work on any uploaded image. Production replaces the hosted endpoint with an owned AWS deployment at ~$0.001 per query.
Inference runs in a separate window because the hosted endpoint is a third-party service we do not theme or proxy. The model accepts an image plus a free-text prompt and returns boxes / points. Same JSON schema we render on this page.
| task | prompt | use case |
|---|---|---|
| Detection | antenna panel, mounting bracket, coaxial cable, RRU | Multi-class component inventory. |
| Grounding | rust or corrosion on metal | Open-vocab damage flagging. |
| OCR | (no input — detects all visible text) | Extract every serial / asset tag. |
| Pointing | the bolt that needs to be retightened | Pixel-precise work-order target. |
| Referring | the antenna closest to the lightning rod | Relative spatial reference. |
| Detection | solar panel with visible soiling or hotspot | Solar PV inspection. |
Run Inference after upload. Logs and inference history live on the hosted side and are not visible to this dashboard. A self-hosted deployment on AWS spot (g4dn.xlarge or larger) exposes logs, throughput metrics, and per-query cost in the existing pipeline.#Model capability comparison
This stack uses LocateAnything-3B as the primary detector. The table below documents the capability surface of the model relative to other open vision-grounding models considered.
| capability | LocateAnything-3B | YOLO-World | Grounding DINO | SAM 2 | Florence-2 |
|---|---|---|---|---|---|
| Open-vocab detection | ✓ | ✓ | ✓ | ✗ | ✓ (prompt-based) |
| Bounding boxes | ✓ | ✓ | ✓ | ✗ | ✓ |
| Confidence scores | ✓ | ✓ | ✓ | ✓ (IoU) | partial |
| Segmentation masks | ✗ | ✗ | ✗ | ✓ pixel-exact | ✓ region |
| Video object tracking | external | external | external | ✓ memory module | ✗ |
| Throughput | 12.7 BPS H100 | ~74 FPS V100 | slow | real-time video | medium |
| OCR / text localization | ✓ | ✗ | ✗ | ✗ | ✓ full OCR |
#Pipeline spec
| model.detection | nvidia/LocateAnything-3B |
| model.detection.arch | MoonViT + Qwen2.5-3B + MLP |
| model.detection.dtype | bfloat16 |
| model.detection.attn | SDPA (max 4k tokens) |
| model.3d | facebook/VGGT-1B |
| model.3d.outputs | world_points, depth, poses, tracks |
| sampling.fps | 2.0 |
| sampling.keyframes_3d | 24 |
| tracking.algo | Hungarian IoU |
| tracking.iou_threshold | 0.30 |
| tracking.max_age_sec | 1.5 |
| compute.gpu | AWS EC2 g4dn.xlarge (NVIDIA T4 16GB) |
| compute.region | us-east-1f |
| compute.market | spot · one-time · max $0.35/hr |
| serving.host | S3 + CloudFront |
| serving.protocol | static JSON/PLY · 24h cache |
#References
- [1]LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding · NVIDIA Research, 2026
- [2]VGGT: Visual Geometry Grounded Transformer · Wang, Chen et al. CVPR 2025 (Best Paper Award)
- [3]
- [4]Qwen2.5: A Party of Foundation Models · Alibaba Cloud Qwen team
- [5]MoonViT: Vision encoder backbone · Moonshot AI