UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Exploring neural network interpretability in visual understanding Wang, Dan


Neural networks (NNs) have reached remarkable performance in computer vision. However, numerous parameters and complex structures make NNs opaque to humans. The failure to comprehend NNs may raise serious issues in real-world applications. My research aims to explore the NN interpretability in diverse visual tasks from post-hoc explanation and intrinsic interpretability perspectives. Convolutional neural networks (CNNs) have outperformed humans in image classification. However, the logic of network decisions remains a puzzle. As such, we propose concept-harmonized hierarchical inference, a post-hoc explanation framework, to explain the decision-making process of CNNs. Firstly, we interpret layered feature representations of NNs with hierarchical visual semantics. Then we explain the NN feature learning as a bottom-up decision logic from low to high semantic levels in which a deep-layer decision is decomposed as a sequence of shallow-layer sub-decisions. With the evolution of virtual reality, researchers are focusing increasingly on inverse rendering: reconstructing a 3D scene from multi-view 2D images. In this field, NNs achieved superior performance in novel view synthesis and 3D reconstruction. For both tasks, learning a 3D representation from input views is the key process where prior methods separately designed a CNN-based single-view feature extraction and a pooling-based multi-view fusion. This incoherent design damages their intrinsic interpretability and performance. Therefore, we aim to design coherent, interpretable NNs that can adequately exploit knowledge of relationships from data. For novel view synthesis, we propose a unified Transformer-based neural radiance field (TransNeRF) conditioned on source views to learn a generic 3D-scene representation. TransNeRF explores deep relationships between the target-rendering view and source views. TransNeRF also improves intrinsic interpretability by enhancing the shape and appearance consistency of a 3D scene. In experiments, TransNeRF outperforms prior neural rendering methods, and the interpretation results are consistent with human perception. We reformulate 3D reconstruction as a sequence-to-sequence prediction and propose an end-to-end Transformer-based framework (EVolT). EVolT jointly explores multi-level associations between input views and the output volume-based 3D representation within our encoder-decoder structure. EVolT achieves state-of-the-art accuracy in multi-view reconstruction with fewer parameters (70% fewer) than prior methods. Experimental results also suggest the strong scaling capability of EVolT.

Item Media

Item Citations and Data


Attribution-NonCommercial-NoDerivatives 4.0 International