MM3DGS SLAM: Multi-modal 3D Gaussian Splatting for SLAM Using Vision, Depth, and Inertial Measurements

1The University of Texas at Austin,

Our model accurately tracks and maps 3D scenes from our UT-MM SLAM Dataset. The left video is from the ego-centric scene post gaussian generation and the right video is from the square scene during concurrent gaussian generation.

Abstract

Simultaneous localization and mapping is essential for position tracking and scene understanding. 3D Gaussian-based map representations enable photorealistic reconstruction and real-time rendering of scenes using multiple posed cameras. We show for the first time that using 3D Gaussians for map representation with unposed camera images and inertial measurements can enable accurate SLAM. Our method, MM3DGS, addresses the limitations of prior neural radiance field-based representations by enabling faster rendering, scale awareness, and improved trajectory tracking. Our framework enables keyframe-based mapping and tracking utilizing loss functions that incorporate relative pose transformations from pre-integrated inertial measurements, depth estimates, and measures of photometric rendering quality. We also release a multi-modal dataset, UT-MM, collected from a mobile robot equipped with a camera and an inertial measurement unit. Experimental evaluation on several scenes from the dataset shows that MM3DGS achieves 3x improvement in tracking and 5% improvement in photometric rendering quality compared to the current 3DGS SLAM state-of-the-art, while allowing real-time rendering of a high-resolution dense 3D map.

Video

Framework Overview

Overview of the MM3DGS framework. We receive camera images and inertial measurements from a mobile robot. We utilize depth measurements and IMU pre-integration for pose optimization using a combined tracking loss. We apply a keyframe selection approach based on image covisibility and the NIQE metric across a sliding window and initialize new 3D Gaussians for keyframes with low opacity and high depth error. Finally, we optimize parameters of the 3D Gaussians according the mapping loss for the selected keyframes.

Visual Comparisons with Ground Truth and SplaTAM

Ours
Ground Truth
Ours
SplaTAM
Ours
Ground Truth
Ours
SplaTAM

Scenes from our UT-MM Dataset

Top: Ego-drive scene. Bottom: Square scene. Our other scenes include 2 Ego-centric scenes, which involves the robot circling an object and 3 Straight scenes, where the robot drives forward in a straight line.

BibTeX

@misc{sun2024mm3dgs,
      title={MM3DGS SLAM: Multi-modal 3D Gaussian Splatting for SLAM Using Vision, Depth, and Inertial Measurements},
      author={Lisong C. Sun and Neel P. Bhatt and Jonathan C. Liu and Zhiwen Fan and Zhangyang Wang and Todd E. Humphreys and Ufuk Topcu},
      year={2024},
      eprint={2404.00923},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}