Is Attention All That NeRF Needs? 
                    
                        ICLR 2023
                    
                
            - Mukund Varma T1*
- Peihao Wang2*
- Xuxi Chen2
- Tianlong Chen2
- Subhashini Venugopalan3
- Zhangyang Wang2
- 1Indian Institute of Technology Madras
- 2University of Texas at Austin
- 3Google Research
* denotes equal contribution
GNT's rendering capabilities on unseen scenes.
                        
News! Our work was presented by Prof. Atlas in his talk at the MIT Vision and Graphics Seminar on 10/17/22.
                    
Abstract
We present Generalizable NeRF Transformer (GNT), a pure, unified transformer-based architecture that efficiently reconstructs Neural Radiance Fields (NeRFs) on the fly from source views. Unlike prior works on NeRF that optimize a per-scene implicit representation by inverting a handcrafted rendering equation, GNT achieves generalizable neural scene representation and rendering, by encapsulating two transformers-based stages. The first stage of GNT, called view transformer, leverages multi-view geometry as an inductive bias for attention-based scene representation, and predicts coordinate-aligned features by aggregating information from epipolar lines on the neighboring views. The second stage of GNT, named ray transformer, renders novel views by ray marching and directly decodes the sequence of sampled point features using the attention mechanism. Our experiments demonstrate that when optimized on a single scene, GNT can successfully reconstruct NeRF without explicit rendering formula, and even improve the PSNR by ~1.3 dB↑ on complex scenes due to the learnable ray renderer. When trained across various scenes, GNT consistently achieves the state-of-the-art performance when transferring to forward-facing LLFF dataset (LPIPS ~20%↓, SSIM ~25%↑) and synthetic blender dataset (LPIPS ~20%↓, SSIM ~4%↑). In addition, we show that depth and occlusion can be inferred from the learned attention maps, which implies that the pure attention mechanism is capable of learning a physically-grounded rendering process. All these results bring us one step closer to the tantalizing hope of utilizing transformers as the ``universal modeling tool'' even for graphics.
Overview of Generalizable NeRF Transformer
Epipolar Geometry Constrained Scene Representation
For the first stage, we propose the view transformer to aggregate coordinate-aligned features from source views. To enforce multi-view geometry, we inject the inductive bias of epipolar constraints into the attention mechanism. Computing attention between every pair of inputs has O(N2) memory complexity, which is computational prohibitive when sampling thousands of points at the same time. Therefore, we propose to only place one read-out token in the query sequence, and let it iteratively summarize features from other data points. This reduces the complexity for each layer up to O(N).
Attention Driven Volumetric Rendering
Volume rendering has been regarded as a key knob of NeRF's success, which simulates light transport and occlusion in a radiance field. However, volume rendering still struggles to model sharp surfaces and complex interference patterns, such as specular reflection, refraction, and translucency. This motivates us to replace the handcrafted rendering function with a data-driven renderer or ray transformer that learns to project a 3D feature field onto 2D images with respect to specified camera poses.
GNT Interpretation and Visualization
                        Conceptually, view transformer attempts to find correspondence between the queried points and source views.
                        The learned attention amounts to a likelihood score that a pixel on the source view is an image of the same point in the 3D space, i.e., no points lies between the target point and the pixel or in other words be occlusion-aware.
                        
                        The ray transformer iteratively aggregates features according to the attention value.
                        This attention value can be regarded as the importance of each point to form the image, which reflects visibility and occlusion reasoned by point-to-point interaction or in other words be depth aware.
                    
Attention maps obtained from view, ray transformer indicating both occlusion, depth reasoning. With no explicit supervision, GNT learns to physically ground its attention maps.
Single Scene Rendering Results
Qualitative comparison for single-scene rendering. GNT recovers the shape of leaves more accurately (in Orchids) and model physical phenomenon like specular reflections (in Drums).
Cross Scene Rendering Results
Qualitative comparison for cross-scene generalization. GNT recovers the edges of petals more accurately (in Flowers) and handle regions which are sparsely visible in the source views (in Fern).
Citation
@inproceedings{
    t2023is,
    title={Is Attention All That Ne{RF} Needs?},
    author={Mukund Varma T and Peihao Wang and Xuxi Chen and Tianlong Chen and Subhashini Venugopalan and Zhangyang Wang},
    booktitle={The Eleventh International Conference on Learning Representations },
    year={2023},
    url={https://openreview.net/forum?id=xE-LtsE-xx}
}