Single-camera 3D hits 96% of lidar accuracy in real time — hardware costs in robotics just moved

Elena Park, Tomas Lindqvist, Ravi Menon~35s readarXiv:2606.02114

Bottom line: a new vision model delivers 3D scene understanding within 4% of lidar-based systems from one standard camera, in real time (31 FPS) on a consumer GPU — shifting hundreds to thousands of dollars per unit from sensors to software.

RetinaFormer outputs depth, 3D object boxes, and spatial relations in a single pass, and generalizes well thanks to pretraining on 40 million unlabeled video frames. Documented weaknesses: adverse weather, glass, and mirrors — so this augments rather than replaces sensor suites in safety-critical systems.

For products where a misread scene is recoverable — warehouse robots, delivery carts, inventory drones — the bill of materials just changed.

Recommended action: if your roadmap includes anything that navigates, ask your hardware team what drops from the BOM when camera-only depth is good enough.