scanhawk/lab
v0.3.1commit a1b2c3d

Open-vocabulary perception pipeline for aerial asset inspection

nvidia/LocateAnything-3B · facebookresearch/VGGT-1B · g4dn.xlarge spot · us-east-1

Reference implementation of a two-stage perception stack applied across seven drone-captured inspection scenarios spanning telecom towers, photovoltaic arrays, high-voltage transmission infrastructure, wind turbine blades, and construction safety compliance. Stage 1 produces per-frame open-vocabulary bounding-box detections using a 3B-parameter vision-language model with Parallel Box Decoding (PBD); the detection vocabulary is set per-prompt at inference time, with no model retraining between domains. Stage 2 reconstructs a sparse 3D point cloud over the same frames using a feed-forward geometry transformer (no SfM, no explicit camera optimization).

The same model that locates antenna panels, mounting brackets, and corrosion on a telecom tower also locates soiled solar modules, flashover damage on insulators, leading-edge erosion on wind blades, and PPE compliance on a construction site. Domain adaptation is a natural-language prompt, not a finetune. Model outputs are produced on AWS spot GPU and served as static JSON / PLY artifacts; bounding-box overlays are drawn client-side in Canvas with track-id continuity interpolated between sampled detection frames.

§01

#Methodology

Detection

Each input video is downsampled to 2 fps. Frames are passed through LocateAnything-3B with two prompts per frame: a component-detection prompt over a fixed 8-class vocabulary, and a damage-grounding prompt iterated across 6 natural-language defect descriptions. Output token format is parsed via regex against the <box> protocol.

# domain vocabulary is set at inference, not training
prompts.telecom = "antenna panel, mounting bracket, coaxial cable,
  RRU, cable tray, lightning rod, structural crossarm"
prompts.solar   = "solar panel, junction box, MC4 connector,
  mounting rail, conduit"
prompts.utility = "suspension insulator, conductor, crossarm,
  transmission tower, vibration damper"
prompts.wind    = "turbine blade, blade tip, leading edge,
  trailing edge, lightning receptor"
prompts.safety  = "worker, hard hat, safety vest, harness,
  scaffolding, excavator"

# each call to detect(img, prompts[vertical]) uses
# the same weights, the same forward pass, no finetune

Tracking

Adjacent sampled frames are associated via Hungarian-optimal IoU matching restricted to same-label detections. A persistent track_id is assigned per object; tracks decay after 1.5 seconds without a match. The browser interpolates bounding-box positions linearly between samples using shared track_ids and tweens at the video render rate.

for each frame_t in samples:
  cost[i,j] = 1 - IoU(det[i].bbox, track[j].last_bbox)
  assignment = linear_sum_assignment(cost, threshold=0.3)
  apply(assignment); spawn new tracks for unmatched

3D reconstruction

24 keyframes are sampled uniformly across the video and passed through VGGT-1B in a single forward pass. The model outputs world-space point maps, camera intrinsics + extrinsics, and depth maps. We retain the top-50% of points by confidence, recenter to the centroid, and normalize the scale to unit at p95 distance. Cloud is downsampled to 80k points before serialization.

out = VGGT(frames)
pts = out.world_points.reshape(-1, 3)
mask = out.confidence > p50(confidence)
pts = (pts[mask] - mean) / p95(norm)
write_ply(pts[random_sample(80_000)])

Serving

All outputs are written to S3 as static objects with a 24h cache TTL. The browser fetches detections.json on demand-modal open and renders bounding boxes via 2D Canvas in a requestAnimationFrame loop synced to video.currentTime. PLY clouds are loaded lazily via three.js PLYLoader.

s3://aerioai-demo-471176250120/
├── videos/{id}.mp4              # input
├── detections/{id}.json         # 2 fps frames + tracks
└── 3d/{id}.ply                  # sparse point cloud
    {id}.poses.json              # camera extrinsics
§02

#Pipeline

extract
decord · 2 fps
detect
LocateAnything-3B
track
IoU · Hungarian
3d
VGGT-1B
emit
JSON · PLY · S3
§03

#Runs

Seven input videos processed end-to-end across five inspection verticals using a single unchanged model and pipeline. Click a row to view artifacts.

idlabelverticalsourcedurationresnotes
demo-001tower_cu_01telecomyoutube/TFCwRzQ5lVI3m19s1280×720close-orbit DJI Mavic Pro, single tower, antenna cluster + RRU
demo-002tower_wide_02telecomyoutube1m50s1280×720wide-shot monopole, sky + rural field background
demo-003tower_topdown_03telecomyoutube1m15s1280×720top-down view of antenna cluster — unique angle, radial panel layout
demo-004solar_pv_01solaryoutube1m47s1280×720PV farm aerial pass — soiling, hotspot, cracked-cell flags
demo-005trans_line_01utilityyoutube1m03s1280×720HV transmission tower aerial — insulator string + structural members
demo-006wind_blade_01renewablesyoutube/AV1isLII4TA0m27s720×1280DJI Matrice 4T 112x zoom — leading-edge erosion + lightning strike
demo-007demo_site_07constructionyoutube1m16s1280×720demolition interior — excavator + structural columns + debris
§04½

#Live inference

The model that produced every overlay above is also exposed as a hosted endpoint. Single-image inference takes 5-8 seconds; the same prompts that drive the pre-loaded runs work on any uploaded image. Production replaces the hosted endpoint with an owned AWS deployment at ~$0.001 per query.

endpoint
backend nvidia/LocateAnything-3B · runtime T4 (zero-cost) · latency ~5-8s

Inference runs in a separate window because the hosted endpoint is a third-party service we do not theme or proxy. The model accepts an image plus a free-text prompt and returns boxes / points. Same JSON schema we render on this page.

open inference endpoint
huggingface.co/spaces/nvidia/LocateAnything
Opens the hosted UI in a new tab. Drop an image, paste a prompt below, click Run Inference. Results render on the right pane there.
new tab →
prompt cookbook · paste these into the "Categories" field
taskpromptuse case
Detectionantenna panel, mounting bracket, coaxial cable, RRUMulti-class component inventory.
Groundingrust or corrosion on metalOpen-vocab damage flagging.
OCR(no input — detects all visible text)Extract every serial / asset tag.
Pointingthe bolt that needs to be retightenedPixel-precise work-order target.
Referringthe antenna closest to the lightning rodRelative spatial reference.
Detectionsolar panel with visible soiling or hotspotSolar PV inspection.
note: the hosted endpoint requires clicking Run Inference after upload. Logs and inference history live on the hosted side and are not visible to this dashboard. A self-hosted deployment on AWS spot (g4dn.xlarge or larger) exposes logs, throughput metrics, and per-query cost in the existing pipeline.
§05

#Model capability comparison

This stack uses LocateAnything-3B as the primary detector. The table below documents the capability surface of the model relative to other open vision-grounding models considered.

capabilityLocateAnything-3BYOLO-WorldGrounding DINOSAM 2Florence-2
Open-vocab detection✓ (prompt-based)
Bounding boxes
Confidence scores✓ (IoU)partial
Segmentation masks✓ pixel-exact✓ region
Video object trackingexternalexternalexternal✓ memory module
Throughput12.7 BPS H100~74 FPS V100slowreal-time videomedium
OCR / text localization✓ full OCR
§06

#Pipeline spec

model.detectionnvidia/LocateAnything-3B
model.detection.archMoonViT + Qwen2.5-3B + MLP
model.detection.dtypebfloat16
model.detection.attnSDPA (max 4k tokens)
model.3dfacebook/VGGT-1B
model.3d.outputsworld_points, depth, poses, tracks
sampling.fps2.0
sampling.keyframes_3d24
tracking.algoHungarian IoU
tracking.iou_threshold0.30
tracking.max_age_sec1.5
compute.gpuAWS EC2 g4dn.xlarge (NVIDIA T4 16GB)
compute.regionus-east-1f
compute.marketspot · one-time · max $0.35/hr
serving.hostS3 + CloudFront
serving.protocolstatic JSON/PLY · 24h cache
§07

#References