CS 180/280A Project 5: Fun with Diffusion Models

Gina Wu (Fall 2024)

This project creates a pipeline for automatically stitching images together to create a larger panoramic image.

Part A: The Power of Diffusion Models

In the first part, I experiment with pre-trained diffusion models, implement diffusion sampling loops, and use them for tasks including inpainting and creating optical illusions.

Note: seed=42 for all results shown.

0: Setup

I load the 2-stage DeepFloyd IF model from Hugging Face, which is a text-to-image model. Below are some examples of the model usage with given prompts. Notice the quality of generated images becomes better with higher inference steps, especially in the first set of realistic images.

'a man wearing a hat' (step=20)

'a man wearing a hat' (step=50)

'a man wearing a hat' (step=100)

'a rocket ship' (step=20)

'a rocket ship' (step=50)

'a rocket ship' (step=100)

'an oil painting of a snowy mountain village' (step=20)

'an oil painting of a snowy mountain village' (step=50)

'an oil painting of a snowy mountain village' (step=100)

1: Sampling Loops

1.1: Implementing the Forward Process

The two main stages of diffusion are the forward and the reverse processes. Here I implement the forward process, which takes in a clean image and adds randomly sampled Gaussian noise scaled by values from a noise schedule (given by the pre-trained model).

Campanile

Noisy Campanile (t=250)

Noisy Campanile (t=500)

Noisy Campanile (t=750)

1.2: Classical Denoising

Naively denoising images with a Gaussian blur filter does not work very well. I tried very hard here to recover the images with different kernel sizes, but the results are still very noisy.

Noisy Campanile (t=250)

Noisy Campanile (t=500)

Noisy Campanile (t=750)

Gaussian Blur Denoising (t=250, kernel_size=11)

Gaussian Blur Denoising (t=500, kernel_size=15)

Gaussian Blur Denoising (t=750, kernel_size=21)

1.3: One-Step Denoising

I can use the pretrained diffusion model to denoise in a single step. For each noisy image, I estimate the noise by passing it through the model along with the text embedding for the prompt "a high quality photo". Then I remove the noise by reversing the forward process in one step.

The results are much better than simply applying a blurring filter, but still not very sharp.

Noisy Campanile (t=250)

Noisy Campanile (t=500)

Noisy Campanile (t=750)

One-Step Denoised Campanile (t=250)

One-Step Denoised Campanile (t=500)

One-Step Denoised Campanile (t=750)

1.4: Iterative Denoising

With iterative denoising, I start with the noisy image at the last time step, and step forward in time to recover the clean image in strides. The image gets recovered gradually as time steps decrease.

Noisy Campanile (t=90)

Noisy Campanile (t=240)

Noisy Campanile (t=390)

Noisy Campanile (t=540)

Noisy Campanile (t=690)

Original

Iteratively Denoised Campanile

One-Step Denoised Campanile

Gaussian Blurred Campanile

1.5: Diffusion Model Sampling

We can also randomly sample from pure noise iteratively.

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

1.6: Classifier-Free Guidance (CFG)

Here sampling is improved with classifier-free guidance, which involves calculating both a conditional and an unconditional noise estimate. The results have higher diversity and quality.

Sample 1 with CFG

Sample 2 with CFG

Sample 3 with CFG

Sample 4 with CFG

Sample 5 with CFG

1.7: Image-to-image Translation

Here we look at the task of editing. Following the SDEdit algorithm, I run the forward process to get a noisy image, and then run the iterative denoising algorithm. The results show a series of edits progressively closer to the target image.

SDEdit (i_start=1)

SDEdit (i_start=3)

SDEdit (i_start=5)

SDEdit (i_start=7)

SDEdit (i_start=10)

SDEdit (i_start=20)

Campanile Original

SDEdit (i_start=1)

SDEdit (i_start=3)

SDEdit (i_start=5)

SDEdit (i_start=7)

SDEdit (i_start=10)

SDEdit (i_start=20)

Donut Original

SDEdit (i_start=1)

SDEdit (i_start=3)

SDEdit (i_start=5)

SDEdit (i_start=7)

SDEdit (i_start=10)

SDEdit (i_start=20)

Cat Original

1.7.1: Editing Hand-Drawn and Web Images

We can do the same thing to pictures and web images, which tend to look better than real images. I particularly like the flower drawing one here and how the shape and colors show up in each edit.

SDEdit (i_start=1)

SDEdit (i_start=3)

SDEdit (i_start=5)

SDEdit (i_start=7)

SDEdit (i_start=10)

SDEdit (i_start=20)

Web Image Cat

SDEdit (i_start=1)

SDEdit (i_start=3)

SDEdit (i_start=5)

SDEdit (i_start=7)

SDEdit (i_start=10)

SDEdit (i_start=20)

Hand Drawn Flower

SDEdit (i_start=1)

SDEdit (i_start=3)

SDEdit (i_start=5)

SDEdit (i_start=7)

SDEdit (i_start=10)

SDEdit (i_start=20)

Hand Drawn Person

1.7.2: Inpainting

For inpainting, I mask out a part of an image and force the model to have the same pixels everywhere else by adding them in after each denoising step, in order to preserve the rest of the photo.

Campanile

Mask

To Replace

Inpainted

Berkeley

Mask

To Replace

Inpainted

Cat

Mask

To Replace

Inpainted

1.7.3: Text-Conditional Image-to-image Translation

We can also guide SDEdit with a text prompt. The first photo input is the campanile with the text prompt "rocket ship", the second is a cat with the prompt "photo of a dog", and the third is a hand-drawn flower with the prompt "oil painting of a campfire". I like how you can see the model recovering the shape of the cat in the second series of photos, and the colors in the third.

Rocket Ship (i_start=1)

Rocket Ship (i_start=3)

Rocket Ship (i_start=5)

Rocket Ship (i_start=7)

Rocket Ship (i_start=10)

Rocket Ship (i_start=20)

Campanile

Dog (i_start=1)

Dog (i_start=3)

Dog (i_start=5)

Dog (i_start=7)

Dog (i_start=10)

Dog (i_start=20)

Cat

Campfire (i_start=1)

Campfire (i_start=3)

Campfire (i_start=5)

Campfire (i_start=7)

Campfire (i_start=10)

Campfire (i_start=20)

Flower

1.8 Visual Anagrams

Visual anagrams are photos that look different when flipped. This is simple to implement, where we just flip the input image and calculate two different noise estimates. I noticed that simply taking the average is not enough sometimes, probably due to some model bias, so for some results I took an emperical weighted average of the noises. I think all the photos here turned out really well, and I really like the hipster barista holding a coffee cup!

'an oil painting of people around a campfire'

'an oil painting of an old man'

'a photo of a hipster barista'

'a lithograph of a skull'

'a photo of a dog'

'a man wearing a hat'

1.9 Hybrid Images

Hybrid images are created by taking noises from two different text prompts and adding one's low frequency features and the other's high frequency features. I loved the way the first three hybrid images turned out. I think having the prompt styles match (e.g. watercolor in photo 3) really helps, but also I had to experiment a lot with various text prompts. The watercolor sunset and beach is a failure case, where the two prompts "blended in" with each other rather than having them show up at different viewing distances.

Hybrid image of a skull and a waterfall

Hybrid image of a rocket and a snowy village

Hybrid image of a watercolor cat and bird

Hybrid image of a watercolor sunset and beach

Part B: Diffusion Models from Scratch

In the second part, I implement and train my own diffusion models from scratch on the MNIST dataset.

1: Training a Single-Step Denoising UNet

The first objective is to directly predict a clean image from a noisy image. We can formulate our objective as follows:

For this task, I created a noisy MNIST dataset by manually adding Gaussian noise to the torch MNIST dataset.

1.1: Implementing the UNet

The UNet architecture consists of a series of downsampling layers, followed by a bottleneck, and upsampling back to input size.

1.2: Using the UNet to Train a Denoiser

Training is straightforward and the loss curve is quite smooth. Here I show the results of denoising after the first and fifth epoch of training.

1.2.1: Training

Epoch 1:

Epoch 5:

1.2.2: Out-of-Distribution Testing

We trained on noise level of 0.5, but the model performs reasonably well on other noise levels.

2: Training a Diffusion Model

Now we change the problem objective to predict noise. This is a more difficult problem, but more rubust as it allows us to sample from a pure noise image. This is the new objective function:

2.1: Adding Time-Conditioning to UNet

We add time conditioning by injecting it into the decoding blocks through fully connected blocks.

2.1.1: Training

I follow the DDPM algorithm of picking a random clean image, adding in random noise, and training the unet to predict noise.

Training is noisier than before as it is a much harder task, but still achieves a good loss trend.

2.1.2: Sampling

Now we can sample from pure noise! The results achieve very clean background and definition.

Epoch 1

Epoch 5

Epoch 10

Epoch 15

Epoch 20

2.2: Adding Class-Conditioning to UNet

Sometimes the results from before don't look like anything. We can improve on this by adding class conditioning, which allows us to control the image generation. I did this by adding in two additional fully connected blocks that take in class label as one hot vectors. Dropout is also implemented such that 10% of class conditioning is masked out.

2.2.1: Training

Training doesn't involve class conditioning.

2.2.2: Sampling

During sampling, I used classifier-free guidance by calculating two noises, one without class conditioning and one with conditioning. The results look very good.

Epoch 1

Epoch 5

Epoch 10

Epoch 15

Epoch 20

This project was rewarding, especially in the second part with debugging a model from scratch. There are many components to modeling from the building blocks, to training and sampling. It was helpful for me to go through the motions of debugging and pinpointing exactly where to look for mistakes, despite it being painful at times. I also appreciated being able to work with pretrained models in part 1!