OctoSense: Self-Supervised Learning for Multimodal Robot Perception

anthonytec21 pts0 comments

OctoSense: Self-Supervised Learning for Multimodal Robot Perception

Self-Supervised Learning for Multimodal Robot Perception

Anthony Bisulco1,<br>Jeremy Wang1,<br>Kostas Daniilidis1,<br>Randall Balestriero2,<br>Pratik Chaudhari1

1GRASP Laboratory, University of Pennsylvania,<br>2Brown University

珞Dataset

Code

Colab

arXiv

One platform, eight sensors, one clock, synchronized multimodal driving across day, night, and degraded conditions.

Why multimodal self-supervised learning?

Self-supervised foundation models, DINO, SigLIP, V-JEPA, transformed robot perception, but<br>they are vision-only , and in the real world no single sensor suffices. Cameras degrade<br>under low light, high dynamic range, and rapid motion ; LiDAR is accurate but sparse with<br>poor semantics. Every sensor has different rates, resolutions, noise, and failure modes, and they<br>fail in different ways. Robust robot perception needs representations that survive these failures, yet<br>self-supervised learning has stayed almost entirely on vision and text.

&ldquo;The defining enabling technology across all field applications is multi-sensor fusion robust to<br>environmental degradation, the problem that keeps the most capable field robots indoors.&rdquo;

Global Robotics Technology Roadmap 2025–2035

OctoSense takes this on directly. We release an open sensor platform , a<br>59-hour time-synchronized dataset , and a method , a late-fusion<br>masked autoencoder that fuses every sensor into one representation and stays robust.

Platform & dataset

An open-source sensor platform with eight time-synchronized sensors, and 59 hours / 2,474 km of<br>driving, one of the largest event-inclusive robotics datasets, with day, night, and degraded-sensor<br>conditions.

Multimodal MAE

Modality-specific tokenizers feed a shared late-fusion masked autoencoder. Token caching at inference makes<br>it real-time, 6.68 ms on an RTX 5090, 112 ms on an embedded Orin NX.

Robust perception

Beats image-only foundation models on depth, flow, segmentation, and ego-motion, and the advantage<br>grows at night and under sensor degradation .

The platform & dataset

OctoSense aligns all the sensors to a single timeline using our PPS time-sync hardware ,<br>a unique six-pulse identifier every four minutes and fifteen seconds<br>lets every stream realign even after a dropped trigger. At native rates the platform produces ~1.7 GB/s;<br>on-board compression (LiDAR/event packets, H.265 video) cuts that 21&times; to 78.7 MB/s<br>with no dropped data. Calibration uses a retro-reflective circle on an AprilGrid jointly visible to the<br>cameras and LiDAR.

The first release spans urban, suburban, and rural driving on Long Island and in Philadelphia across<br>sunrise, daytime, sunset, and night, including sun-flare and packet-loss degradation. Every 5-second<br>window is captioned (Gemma 4) and embedded (Qwen3) into a FAISS index, so the data is searchable in natural<br>language.

ModalitySensorInfoRate

RGB (stereo)2&times; FLIR Blackfly S1920&times;1456100 Hz<br>Event (stereo)2&times; SilkyEV VGA (Prophesee)640&times;480&asymp;7 MEv/s<br>Thermal FLIR A35320&times;25650 Hz<br>LiDAR Ouster OS1-6464 &times; 204810 Hz<br>IMU VectorNav VN-100TAcc/Gyro/Mag/Baro400 Hz<br>GNSS u-blox ZED-F9PRTK (NTRIP)5 Hz<br>Proprioception Vehicle CAN / quadruped jointssteering, throttle, brake / joint angles50–100 Hz

Custom hardware

CAD of the sensor platform: stereo RGB + event cameras, LiDAR, thermal, IMU, and GPS on one mount.

The custom SyncBoard that hardware-triggers every sensor off one clock.

Everything here is open source , the mechanical CAD, the sensor mounts, and the custom<br>electronics. The platform carries all eight sensors on one adjustable bar above a desktop-class CPU, powered by a<br>24 V battery for about an hour of mobile operation.

The board on the right is our custom SyncBoard , the heart of the platform's hardware time<br>synchronization. A temperature-compensated oscillator and a microcontroller generate a pulse-per-second<br>trigger and fan it out across the PCB to every sensor, so each stream can be aligned to a common timeline<br>in post-processing.

Diverse conditions

OctoSense spans a wide range of environments and lighting across Long Island and Philadelphia,<br>from night-time glare and low-sun lens flare to roadside landmarks,<br>water crossings, snow-covered descents, and unpaved forest roads. This breadth is exactly where<br>single-camera models struggle and multi-sensor fusion pays off.

Degraded perception. Tunnels, blinding sun, lens flare, fog, and near-darkness routinely<br>wash out the camera. OctoSense deliberately captures these failure cases, where image-only models break<br>down and complementary sensors keep the scene observable.

Play with the time-synchronized data

A live Rerun viewer with a short clip (desktop-only) from a<br>city drive, every sensor on one shared timeline: stereo RGB + event cameras, infrared, the LiDAR point cloud,<br>IMU, GPS, and CAN signals, plus scene captions. Scrub the timeline, rotate the 3D view, and toggle streams,<br>right here in...

sensor platform perception time times octosense

Related Articles