UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Diffusion models for visual content generation : challenges and insights Liu, Jiahe

Abstract

Significant advancements have recently been made in image and video generative models. Among these, diffusion models have demonstrated a strong capability for generating high-quality images and videos, thus inviting significant study within the field. However, despite these exciting achievements, diffusion models for visual content generation still face numerous challenges. In this thesis, we focus on two key challenges facing diffusion models and propose potential solutions to address them. Firstly, the research on metrics for assessing generative models remains relatively underexplored, particularly in the domain of video generation. To bridge this research gap, we propose the Fréchet Video Motion Distance (FVMD) metric, which focuses on evaluating motion consistency in video generation. Specifically, we design explicit motion features based on key-point tracking and then measure the similarity between these features via the Fréchet distance. We conduct a sensitivity analysis by injecting noise into real videos to verify the effectiveness of FVMD. Further, we carry out a large-scale human study, demonstrating that our metric effectively detects temporal noise and aligns better with human perceptions of generated video quality than existing metrics. Second, diffusion models face challenges in compositionality and interpretability. While humans understand images structurally, generative models typically generate all pixels simultaneously. Latent Diffusion Models, widely used in this domain, rely on continuous latent variables from Variational Autoencoders (VAEs), which lack interpretability and structure. To address this, we propose DiffuseDRAW, a novel framework incorporating structured latent variables with diffusion models. Our approach integrates non-parametric structured latent variables from NP-DRAW with discrete vector-quantized representations from VQ-GAN. Built upon VQ-GAN, our model transforms input images into combined discrete latent variables and applies a diffusion model in the discrete latent space. We model dependencies between structured and discrete latent variables using a Transformer backbone with cross-conditioning. Experiments on CIFAR-10 and LSUN datasets demonstrate that our model outperforms prior structured generative models and competes with state-of-the-art diffusion models. Moreover, its compositionality and interpretability offer significant advantages in zero-shot latent space editing.

Item Media

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International