OctoSense: Self-Supervised Learning for Multimodal Robot Perception

Self-Supervised Learning for Multimodal Robot Perception

Anthony Bisulco1, Jeremy Wang1, Kostas Daniilidis1, Randall Balestriero2, Pratik Chaudhari1

1GRASP Laboratory, University of Pennsylvania, 2Brown University

珞Dataset

Code

Colab

arXiv

One platform, eight sensors, one clock, synchronized multimodal driving across day, night, and degraded conditions.

Why multimodal self-supervised learning?

Self-supervised foundation models, DINO, SigLIP, V-JEPA, transformed robot perception, but they are vision-only , and in the real world no single sensor suffices. Cameras degrade under low light, high dynamic range, and rapid motion ; LiDAR is accurate but sparse with poor semantics. Every sensor has different rates, resolutions, noise, and failure modes, and they fail in different ways. Robust robot perception needs representations that survive these failures, yet self-supervised learning has stayed almost entirely on vision and text.

“The defining enabling technology across all field applications is multi-sensor fusion robust to environmental degradation, the problem that keeps the most capable field robots indoors.”

Global Robotics Technology Roadmap 2025–2035

OctoSense takes this on directly. We release an open sensor platform , a 59-hour time-synchronized dataset , and a method , a late-fusion masked autoencoder that fuses every sensor into one representation and stays robust.

Platform & dataset

An open-source sensor platform with eight time-synchronized sensors, and 59 hours / 2,474 km of driving, one of the largest event-inclusive robotics datasets, with day, night, and degraded-sensor conditions.

Multimodal MAE

Modality-specific tokenizers feed a shared late-fusion masked autoencoder. Token caching at inference makes it real-time, 6.68 ms on an RTX 5090, 112 ms on an embedded Orin NX.

Robust perception

Beats image-only foundation models on depth, flow, segmentation, and ego-motion, and the advantage grows at night and under sensor degradation .

The platform & dataset

OctoSense aligns all the sensors to a single timeline using our PPS time-sync hardware , a unique six-pulse identifier every four minutes and fifteen seconds lets every stream realign even after a dropped trigger. At native rates the platform produces ~1.7 GB/s; on-board compression (LiDAR/event packets, H.265 video) cuts that 21× to 78.7 MB/s with no dropped data. Calibration uses a retro-reflective circle on an AprilGrid jointly visible to the cameras and LiDAR.

The first release spans urban, suburban, and rural driving on Long Island and in Philadelphia across sunrise, daytime, sunset, and night, including sun-flare and packet-loss degradation. Every 5-second window is captioned (Gemma 4) and embedded (Qwen3) into a FAISS index, so the data is searchable in natural language.

ModalitySensorInfoRate

RGB (stereo)2× FLIR Blackfly S1920×1456100 Hz Event (stereo)2× SilkyEV VGA (Prophesee)640×480≈7 MEv/s Thermal FLIR A35320×25650 Hz LiDAR Ouster OS1-6464 × 204810 Hz IMU VectorNav VN-100TAcc/Gyro/Mag/Baro400 Hz GNSS u-blox ZED-F9PRTK (NTRIP)5 Hz Proprioception Vehicle CAN / quadruped jointssteering, throttle, brake / joint angles50–100 Hz

Custom hardware

CAD of the sensor platform: stereo RGB + event cameras, LiDAR, thermal, IMU, and GPS on one mount.

The custom SyncBoard that hardware-triggers every sensor off one clock.

Everything here is open source , the mechanical CAD, the sensor mounts, and the custom electronics. The platform carries all eight sensors on one adjustable bar above a desktop-class CPU, powered by a 24 V battery for about an hour of mobile operation.

The board on the right is our custom SyncBoard , the heart of the platform's hardware time synchronization. A temperature-compensated oscillator and a microcontroller generate a pulse-per-second trigger and fan it out across the PCB to every sensor, so each stream can be aligned to a common timeline in post-processing.

Diverse conditions

OctoSense spans a wide range of environments and lighting across Long Island and Philadelphia, from night-time glare and low-sun lens flare to roadside landmarks, water crossings, snow-covered descents, and unpaved forest roads. This breadth is exactly where single-camera models struggle and multi-sensor fusion pays off.

Degraded perception. Tunnels, blinding sun, lens flare, fog, and near-darkness routinely wash out the camera. OctoSense deliberately captures these failure cases, where image-only models break down and complementary sensors keep the scene observable.

Play with the time-synchronized data

A live Rerun viewer with a short clip (desktop-only) from a city drive, every sensor on one shared timeline: stereo RGB + event cameras, infrared, the LiDAR point cloud, IMU, GPS, and CAN signals, plus scene captions. Scrub the timeline, rotate the 3D view, and toggle streams, right here in...

OctoSense: Self-Supervised Learning for Multimodal Robot Perception

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

ZCode – Harness for GLM-5.2

Apertus – Open Foundation Model for Sovereign AI