OctoSense: Self-Supervised Learning for Multimodal Robot Perception
Self-Supervised Learning for Multimodal Robot Perception
Anthony Bisulco1,<br>Jeremy Wang1,<br>Kostas Daniilidis1,<br>Randall Balestriero2,<br>Pratik Chaudhari1
1GRASP Laboratory, University of Pennsylvania,<br>2Brown University
珞Dataset
Code
Colab
arXiv
One platform, eight sensors, one clock, synchronized multimodal driving across day, night, and degraded conditions.
Why multimodal self-supervised learning?
Self-supervised foundation models, DINO, SigLIP, V-JEPA, transformed robot perception, but<br>they are vision-only , and in the real world no single sensor suffices. Cameras degrade<br>under low light, high dynamic range, and rapid motion ; LiDAR is accurate but sparse with<br>poor semantics. Every sensor has different rates, resolutions, noise, and failure modes, and they<br>fail in different ways. Robust robot perception needs representations that survive these failures, yet<br>self-supervised learning has stayed almost entirely on vision and text.
“The defining enabling technology across all field applications is multi-sensor fusion robust to<br>environmental degradation, the problem that keeps the most capable field robots indoors.”
Global Robotics Technology Roadmap 2025–2035
OctoSense takes this on directly. We release an open sensor platform , a<br>59-hour time-synchronized dataset , and a method , a late-fusion<br>masked autoencoder that fuses every sensor into one representation and stays robust.
Platform & dataset
An open-source sensor platform with eight time-synchronized sensors, and 59 hours / 2,474 km of<br>driving, one of the largest event-inclusive robotics datasets, with day, night, and degraded-sensor<br>conditions.
Multimodal MAE
Modality-specific tokenizers feed a shared late-fusion masked autoencoder. Token caching at inference makes<br>it real-time, 6.68 ms on an RTX 5090, 112 ms on an embedded Orin NX.
Robust perception
Beats image-only foundation models on depth, flow, segmentation, and ego-motion, and the advantage<br>grows at night and under sensor degradation .
The platform & dataset
OctoSense aligns all the sensors to a single timeline using our PPS time-sync hardware ,<br>a unique six-pulse identifier every four minutes and fifteen seconds<br>lets every stream realign even after a dropped trigger. At native rates the platform produces ~1.7 GB/s;<br>on-board compression (LiDAR/event packets, H.265 video) cuts that 21× to 78.7 MB/s<br>with no dropped data. Calibration uses a retro-reflective circle on an AprilGrid jointly visible to the<br>cameras and LiDAR.
The first release spans urban, suburban, and rural driving on Long Island and in Philadelphia across<br>sunrise, daytime, sunset, and night, including sun-flare and packet-loss degradation. Every 5-second<br>window is captioned (Gemma 4) and embedded (Qwen3) into a FAISS index, so the data is searchable in natural<br>language.
ModalitySensorInfoRate
RGB (stereo)2× FLIR Blackfly S1920×1456100 Hz<br>Event (stereo)2× SilkyEV VGA (Prophesee)640×480≈7 MEv/s<br>Thermal FLIR A35320×25650 Hz<br>LiDAR Ouster OS1-6464 × 204810 Hz<br>IMU VectorNav VN-100TAcc/Gyro/Mag/Baro400 Hz<br>GNSS u-blox ZED-F9PRTK (NTRIP)5 Hz<br>Proprioception Vehicle CAN / quadruped jointssteering, throttle, brake / joint angles50–100 Hz
Custom hardware
CAD of the sensor platform: stereo RGB + event cameras, LiDAR, thermal, IMU, and GPS on one mount.
The custom SyncBoard that hardware-triggers every sensor off one clock.
Everything here is open source , the mechanical CAD, the sensor mounts, and the custom<br>electronics. The platform carries all eight sensors on one adjustable bar above a desktop-class CPU, powered by a<br>24 V battery for about an hour of mobile operation.
The board on the right is our custom SyncBoard , the heart of the platform's hardware time<br>synchronization. A temperature-compensated oscillator and a microcontroller generate a pulse-per-second<br>trigger and fan it out across the PCB to every sensor, so each stream can be aligned to a common timeline<br>in post-processing.
Diverse conditions
OctoSense spans a wide range of environments and lighting across Long Island and Philadelphia,<br>from night-time glare and low-sun lens flare to roadside landmarks,<br>water crossings, snow-covered descents, and unpaved forest roads. This breadth is exactly where<br>single-camera models struggle and multi-sensor fusion pays off.
Degraded perception. Tunnels, blinding sun, lens flare, fog, and near-darkness routinely<br>wash out the camera. OctoSense deliberately captures these failure cases, where image-only models break<br>down and complementary sensors keep the scene observable.
Play with the time-synchronized data
A live Rerun viewer with a short clip (desktop-only) from a<br>city drive, every sensor on one shared timeline: stereo RGB + event cameras, infrared, the LiDAR point cloud,<br>IMU, GPS, and CAN signals, plus scene captions. Scrub the timeline, rotate the 3D view, and toggle streams,<br>right here in...