One cheap camera, near-lidar 3D vision — robots are about to get much cheaper eyes

Elena Park, Tomas Lindqvist, Ravi Menon~55s readarXiv:2606.02114

Close one eye and you can still pour coffee without missing the cup. Your brain squeezes 3D out of a flat image using shadows, sizes, and a lifetime of experience. Robots have mostly needed expensive extra hardware to manage the same trick — spinning lidar sensors, paired stereo cameras. RetinaFormer is a neural network that gets there with a single ordinary camera, fast enough for a robot to use while moving.

The numbers: on standard driving and indoor benchmarks, it reconstructs the geometry of a scene within 4% of lidar-based systems, and it runs at 31 frames per second on one consumer graphics card. The new part is that it delivers the full 3D layout — distances, objects, how they relate — in a single look, instead of stitching estimates together across many frames.

The catch: it struggles exactly where your one eye does — fog, rain, glass, mirrors, and scenes unlike anything in its training. Within 4% of lidar on a benchmark is not the same as lidar reliability in a blizzard, so safety-critical machines will still carry backup sensors.

Why you should care: a camera costs thirty dollars; lidar costs thousands. Every time software closes that gap, capable robots — delivery carts, warehouse pickers, home helpers — get closer to prices where they show up in your daily life instead of in demo videos.