- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Diffusion models for visual content generation : challenges...
Open Collections
UBC Theses and Dissertations
UBC Theses and Dissertations
Diffusion models for visual content generation : challenges and insights Liu, Jiahe
Abstract
Significant advancements have recently been made in image and video generative models. Among these, diffusion models have demonstrated a strong capability for generating high-quality images and videos, thus inviting significant study within the field. However, despite these exciting achievements, diffusion models for visual content generation still face numerous challenges. In this thesis, we focus on two key challenges facing diffusion models and propose potential solutions to address them. Firstly, the research on metrics for assessing generative models remains relatively underexplored, particularly in the domain of video generation. To bridge this research gap, we propose the Fréchet Video Motion Distance (FVMD) metric, which focuses on evaluating motion consistency in video generation. Specifically, we design explicit motion features based on key-point tracking and then measure the similarity between these features via the Fréchet distance. We conduct a sensitivity analysis by injecting noise into real videos to verify the effectiveness of FVMD. Further, we carry out a large-scale human study, demonstrating that our metric effectively detects temporal noise and aligns better with human perceptions of generated video quality than existing metrics. Second, diffusion models face challenges in compositionality and interpretability. While humans understand images structurally, generative models typically generate all pixels simultaneously. Latent Diffusion Models, widely used in this domain, rely on continuous latent variables from Variational Autoencoders (VAEs), which lack interpretability and structure. To address this, we propose DiffuseDRAW, a novel framework incorporating structured latent variables with diffusion models. Our approach integrates non-parametric structured latent variables from NP-DRAW with discrete vector-quantized representations from VQ-GAN. Built upon VQ-GAN, our model transforms input images into combined discrete latent variables and applies a diffusion model in the discrete latent space. We model dependencies between structured and discrete latent variables using a Transformer backbone with cross-conditioning. Experiments on CIFAR-10 and LSUN datasets demonstrate that our model outperforms prior structured generative models and competes with state-of-the-art diffusion models. Moreover, its compositionality and interpretability offer significant advantages in zero-shot latent space editing.
Item Metadata
Title |
Diffusion models for visual content generation : challenges and insights
|
Creator | |
Supervisor | |
Publisher |
University of British Columbia
|
Date Issued |
2024
|
Description |
Significant advancements have recently been made in image and video generative models. Among these, diffusion models have demonstrated a strong capability for generating high-quality images and videos, thus inviting significant study within the field. However, despite these exciting achievements, diffusion models for visual content generation still face numerous challenges. In this thesis, we focus on two key challenges facing diffusion models and propose potential solutions to address them.
Firstly, the research on metrics for assessing generative models remains relatively underexplored, particularly in the domain of video generation. To bridge this research gap, we propose the Fréchet Video Motion Distance (FVMD) metric, which focuses on evaluating motion consistency in video generation. Specifically, we design explicit motion features based on key-point tracking and then measure the similarity between these features via the Fréchet distance. We conduct a sensitivity analysis by injecting noise into real videos to verify the effectiveness of FVMD. Further, we carry out a large-scale human study, demonstrating that our metric effectively detects temporal noise and aligns better with human perceptions of generated video quality than existing metrics.
Second, diffusion models face challenges in compositionality and interpretability. While humans understand images structurally, generative models typically generate all pixels simultaneously. Latent Diffusion Models, widely used in this domain, rely on continuous latent variables from Variational Autoencoders (VAEs), which lack interpretability and structure. To address this, we propose DiffuseDRAW, a novel framework incorporating structured latent variables with diffusion models. Our approach integrates non-parametric structured latent variables from NP-DRAW with discrete vector-quantized representations from VQ-GAN. Built upon VQ-GAN, our model transforms input images into combined discrete latent variables and applies a diffusion model in the discrete latent space. We model dependencies between structured and discrete latent variables using a Transformer backbone with cross-conditioning. Experiments on CIFAR-10 and LSUN datasets demonstrate that our model outperforms prior structured generative models and competes with state-of-the-art diffusion models. Moreover, its compositionality and interpretability offer significant advantages in zero-shot latent space editing.
|
Genre | |
Type | |
Language |
eng
|
Date Available |
2024-12-12
|
Provider |
Vancouver : University of British Columbia Library
|
Rights |
Attribution-NonCommercial-NoDerivatives 4.0 International
|
DOI |
10.14288/1.0447491
|
URI | |
Degree | |
Program | |
Affiliation | |
Degree Grantor |
University of British Columbia
|
Graduation Date |
2025-05
|
Campus | |
Scholarly Level |
Graduate
|
Rights URI | |
Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International