NViST: In the Wild New View Synthesis from a Single Image with Transformers

Abstract

We propose, NViST, a transformer-based model for novel-view synthesis from a single image, trained on a large-scale dataset of in-the-wild images with complex backgrounds. NViST transforms image inputs directly into a radiance field, adopting a scalable transformer-based architecture. In practice, NViST exploits the self-supervised features learnt by a masked autoencoder (MAE), and learns a novel decoder that translates features to 3D tokens via cross-attention and adaptive layer normalization. Our model is efficient at inference since only a single forward-pass is needed to predict a 3D representation, unlike methods that require test-time optimization or sampling multiple times such as 3D-aware diffusion models.

We tackle further limitations of current new-view synthesis models. First, unlike most generative models that are trained in a category-specific manner, often on synthetic datasets or on masked inputs, our model is trained on MVImgNet, a large-scale dataset of real-world, casually-captured videos containing hundreds of object categories with diverse backgrounds.

We show results on unseen objects and categories on MVImgNet and even casual phone captures. We conduct qualitative and quantitative evaluations on MVImgNet and ShapeNet to show that our model represents a step forward towards enabling true in-the-wild novel-view synthesis from a single image.

The contributions of this paper are as follow.

Our novel decoder maps MAE features to 3D tokens via cross-attention and adaptive layer normalization conditioned on camera parameters to tackle scale ambiguities.
Efficient inference as it only needs one feed-forward pass.
NViST only requires relative pose during training and does not require a canonicalized dataset.
Our qualitative and quantitative evaluations on test (unseen) scenes and categories show competitive performance on challenging real-world scenes.
We demonstrate good performance on a zero-shot new-view synthesis task on out-of-distribution scenes captured on a mobile-phone device.

Methodology

NViST is a feed-forward transformer-based model that takes a single in-the-wild image as input, and renders a novel view. The encoder, a finetuned Masked Autoencoder (MAE), generates self-supervised feature tokens and a class token (cls). Our novel decoder conditioned on normalised focal lengths and camera distance via adaptive layer normalisation, translates feature tokens to output tokens via cross-attention, and reasons about the occluded region through self-attention.

We employ MLPs to reshape the output tokens into a vector-matrix representation that encodes the radiance field and can be queried to obtain the color and sigma of points to perform volume rendering. NViST is trained end-to-end via a combination of losses: photometric, perceptual and a distortion-based regulariser. NViST is efficient at inference since it requires only a single forward-pass to predict a 3D from an input image.

Results

BibTeX


      @InProceedings{Jang_2024_CVPR,
        author    = {Jang, Wonbong and Agapito, Lourdes},
        title     = {NViST: In the Wild New View Synthesis from a Single Image with Transformers},
        booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
        month     = {June},
        year      = {2024},
        pages     = {10181-10193}
    }

Acknowledgement

The research presented here has been partly supported by a sponsored research award from Cisco Research. This project made use of time on HPC Tier 2 facilities Baskerville (funded by EPSRC EP/T022221/1 and operated by ARC at the University of Birmingham) and JADE2 (funded by EPSRC EP/T022205/1). We are grateful to Niloy Mitra and Danail Stoyanov for fruitful discussions.