CVPR Paper Highlight: Building 3D-aware AI from 2D image data

Shijie Zhou, a PhD student advised by Prof. Achuta Kadambi at UCLA Henry Samueli School of Engineering, has received the Highlight Paper award for the paper titled “Feature 3DGS: Supercharging 3D Gaussian Splatting to Enable Distilled Feature Fields.” The award was received at the IEEE Computer Vision and Pattern Recognition (CVPR), which will take place in Seattle, Washington, in June 2024. CVPR is a top research venue in AI with an impact factor comparable to Nature, Science, and New England Journal of Medicine. Being further recognized as one of the Highlight papers of CVPR is therefore a notable honor for the research team.

“This research paper addresses a long standing puzzle in computer vision: namely the input data for vision are 2-dimensional (2D) photographs but the inferences we want to make about the world are 3-dimensional, like “where is an object located?” or “what path should a robot traverse?”. Therefore, while AI models built on 2D photos have powerful capabilities, it would be possible to unlock more capabilities if these models were to become increasingly aware of the 3-dimensional (3D) world. Example use cases that can benefit robot navigation, autonomous driving, robotic surgery on 3-dimensional biological organs, and many more.” says Zhou, the first author of the paper.

Unfortunately, when AI models are trained on 2D data, obtaining 3D awareness is not straightforward: an algorithm can be developed to “lift” the 2D feature representation to 3D feature representations, but this is an underdetermined inverse problem that is also compute constrained. To solve this problem, the paper renders the possible 3D light paths using a Gaussian radiance field representation, which constrains the 3D representation to a parametric representation. In contrast to a neural network, such an approach is faster, but has a drawback in rendering high-dimensional feature embedding. This is due to inconsistencies in spatial resolution and feature channels compared to RGB images. The paper introduces a further key contribution — a Parallel N-dimensional Gaussian Rasterizer — to motivate solutions for algorithmic speed and quality. 

“3D datasets are comparably rare, requiring specialized sensing hardware. At the moment, and for the foreseeable future, the largest scale computer vision datasets are made of 2D images and videos. So there is a need for algorithmic methods that can extract 3D awareness from 2D input data. It’s hard to achieve because 3D cues are only partially present in 2D photos, but it makes for an exciting research problem, with practical impact”, says Achuta Kadambi, an Assistant Professor at UCLA in Electrical Engineering and Computer Science, and a senior author of the paper.

“Distilling 2D Foundation Models into 3D scene representations is an exciting avenue towards 3D scene understanding”, says Vincent Sitzmann, an Assistant Professor at MIT of Electrical Engineering and Computer Science, who was not involved in the research. “This paper takes a great step towards accelerating feature fields dramatically via Gaussian Splatting, unlocking feature fields across many applications which previously would have been intractable”. Sitzmann further notes that new applications could include those “ranging from 3D editing in computer graphics to robotic manipulation.”

Currently, the paper demonstrates results on a smaller set of applications in computer vision that include 3D semantic scene understanding, language-guided scene editing, and promptable segmentation in 3D scenes. All tasks were completed at ultra-fast inference speeds with results that outperform previous baselines in speed (2.7x faster feature rendering) and performance (23% improvement on a range of downstream tasks). These results are a significant step for “the foundations for general-purpose AI modeling and contributions to the sciences of computer vision and AI/ML,” says Suya You, a senior researcher of DEVCOM Army Research Laboratory (ARL), and a senior author of the Paper, adding that he is “proud of being part of the research team working with such talented students and faculty” on the work. The work was done in collaboration between the Visual Machines Group at UCLA (VMG) led by Prof. Kadambi, an Army Research Lab (ARL) team led by Dr. Suya You, and the Visual Informatics Group at UT Austin (VITA) led by Prof. Atlas Wang. Project details are available at along with downloadable code and it will be presented as a Highlight Paper this summer at CVPR.