Abstract

Teaser image

Mapping and scene representation are fundamental to reliable planning and navigation in mobile robots. While purely geometric maps using voxel grids allow for general navigation, obtaining up-to-date spatial and semantically rich representations that scale to dynamic large-scale environments remains challenging. In this work, we present CURB-OSG, an open-vocabulary dynamic 3D scene graph engine that generates hierarchical decompositions of urban driving scenes via multi-agent collaboration. By fusing the camera and LiDAR observations from multiple perceiving agents with unknown initial poses, our approach generates more accurate maps compared to a single agent while constructing a unified open-vocabulary semantic hierarchy of the scene. Unlike previous methods that rely on ground truth agent poses or are evaluated purely in simulation, CURB-OSG alleviates these constraints. We evaluate the capabilities of CURB-OSG on real-world multi-agent sensor data obtained from multiple sessions of the Oxford Radar RobotCar dataset. We demonstrate improved mapping and object prediction accuracy through multi-agent collaboration as well as evaluate the environment partitioning capabilities of the proposed approach.

Technical Approach

Overview of our approach

Overview of our CURB-OSG approach operating on LiDAR and camera data from multiple agents. On each agent, we perform open-vocabulary perception that processes monocular image data from surround-view cameras to extract 2D object detections using Grounding DINO and dynamic object tracks through MASA. Simultaneously, we estimate each agent's LiDAR odometry via scan matching and construct keyframes that are sent to a central server. All object observations, both static and dynamic, are projected onto the filtered LiDAR point clouds, extracted to obtain 3D object observations relative to the keyframe poses. The central server receives the keyframes and runs graph-based SLAM coupled with LiDAR-based loop closure detection~\cite{kim2018scancontext} to estimate a joint pose graph holding the historic poses of all agents. Finally, all object observations and the semantic point clouds are processed in our 3D scene graph construction module to obtain a unified, hierarchical representation of the environment.

Video

Code & Sample Data

A software implementation of this project based on PyTorch can be found in our GitHub repository for academic usage and is released under the GPLv3 license. For any commercial purpose, please contact the authors.

Publications

If you find our work useful, please consider citing our paper:

Tim Steinke, Martin Büchner, Niclas Vödisch, and Abhinav Valada
Collaborative Dynamic 3D Scene Graphs for Open-Vocabulary Urban Scene Understanding
arXiv preprint arXiv:2503.08474, 2025.

(PDF) (BibTeX)

Authors

Tim Steinke

Tim Steinke

University of Freiburg

Martin Büchner

Martin Büchner

University of Freiburg

Niclas Vödisch

Niclas Vödisch

University of Freiburg

Abhinav Valada

Abhinav Valada

University of Freiburg

Acknowledgment

This work was funded by the German Research Foundation (DFG) Emmy Noether Program grant number 468878300.