UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Repurposing large pretrained diffusion models for unsupervised visual understanding and efficient adaptation Hedlin, Eric

Abstract

Large pretrained text-conditioned image generation models learn a compositional and structured latent representation of visual concepts, showcasing their rich understanding of the world through their ability to generate diverse, coherent images. These models link text descriptions to visual concepts, unifying concepts across a range of conditions such as understanding the relationships between the text input and objects in a scene. This thesis explores how this link between text and visual concepts enables identifying consistent semantic regularities across images, where similar regions are mapped through the same text embedding. We show that this can be leveraged for tasks like semantic correspondence and estimating consistent keypoints, simply by optimizing the text embedding to activate highly in a specific region in the image for a given token. We also take advantage of the capacity of the model for one-shot personalization given only a single image. We leverage this by training hypernetworks to quickly estimate network weights for subject personalized generation, whose convergence is only possible due to the smooth underlying representation of concepts learned by these models. This PhD thesis leverages large pretrained diffusion models to address three key areas: semantic correspondence, unsupervised keypoint detection, and efficient hypernetwork-based adaptation for personalized model fine tuning. For semantic correspondence, we optimize text tokens to focus attention on specific regions in an image, leveraging the latent knowledge of large pretrained models to identify correspondences from a single image without additional supervision. For unsupervised keypoint detection, we localize text tokens across a collection of images to identify common keypoints, using a collection of images to focus the model on a specific concept, leveraging the knowledge within the pretrained model to generalize without ground truth keypoints. We also investigate hypernetwork-based methods for generating weights for large model personalization conditioned on a single image, providing an efficient alternative to compute intense optimization without requiring ground truth weights. This work highlights the versatility of diffusion models, extending their utility beyond image generation while proposing scalable, efficient solutions for downstream tasks of semantic correspondence, unsupervised keypoint estimation, and hypernetwork-based personalized model fine tuning.

Item Media

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International