This project creates a pipeline for automatically stitching images together to create a larger panoramic image.
In the first part, I experiment with pre-trained diffusion models, implement diffusion sampling loops,
and use them for tasks including inpainting and creating optical illusions.
Note: seed=42 for all results shown.
I load the 2-stage DeepFloyd IF model from Hugging Face, which is a text-to-image model. Below are some examples of the model usage with given prompts. Notice the quality of generated images becomes better with higher inference steps, especially in the first set of realistic images.
The two main stages of diffusion are the forward and the reverse processes. Here I implement the forward process, which takes in a clean image and adds randomly sampled Gaussian noise scaled by values from a noise schedule (given by the pre-trained model).
Naively denoising images with a Gaussian blur filter does not work very well. I tried very hard here to recover the images with different kernel sizes, but the results are still very noisy.
I can use the pretrained diffusion model to denoise in a single step. For each noisy image,
I estimate the noise by passing it through the model along with the text embedding for the prompt
"a high quality photo". Then I remove the noise by reversing the forward process in one step.
The results are much better than simply applying a blurring filter, but still not very sharp.
With iterative denoising, I start with the noisy image at the last time step, and step forward in time to recover the clean image in strides. The image gets recovered gradually as time steps decrease.
We can also randomly sample from pure noise iteratively.
Here sampling is improved with classifier-free guidance, which involves calculating both a conditional and an unconditional noise estimate. The results have higher diversity and quality.
Here we look at the task of editing. Following the SDEdit algorithm, I run the forward process to get a noisy image, and then run the iterative denoising algorithm. The results show a series of edits progressively closer to the target image.
We can do the same thing to pictures and web images, which tend to look better than real images. I particularly like the flower drawing one here and how the shape and colors show up in each edit.
For inpainting, I mask out a part of an image and force the model to have the same pixels everywhere else by adding them in after each denoising step, in order to preserve the rest of the photo.
We can also guide SDEdit with a text prompt. The first photo input is the campanile with the text prompt "rocket ship", the second is a cat with the prompt "photo of a dog", and the third is a hand-drawn flower with the prompt "oil painting of a campfire". I like how you can see the model recovering the shape of the cat in the second series of photos, and the colors in the third.
Visual anagrams are photos that look different when flipped. This is simple to implement, where we just flip the input image and calculate two different noise estimates. I noticed that simply taking the average is not enough sometimes, probably due to some model bias, so for some results I took an emperical weighted average of the noises. I think all the photos here turned out really well, and I really like the hipster barista holding a coffee cup!
Hybrid images are created by taking noises from two different text prompts and adding one's low frequency features and the other's high frequency features. I loved the way the first three hybrid images turned out. I think having the prompt styles match (e.g. watercolor in photo 3) really helps, but also I had to experiment a lot with various text prompts. The watercolor sunset and beach is a failure case, where the two prompts "blended in" with each other rather than having them show up at different viewing distances.
In the second part, I implement and train my own diffusion models from scratch on the MNIST dataset.
The first objective is to directly predict a clean image from a noisy image. We can formulate our objective as follows:
For this task, I created a noisy MNIST dataset by manually adding Gaussian noise to the torch MNIST dataset.
The UNet architecture consists of a series of downsampling layers, followed by a bottleneck, and upsampling back to input size.
Training is straightforward and the loss curve is quite smooth. Here I show the results of denoising after the first and fifth epoch of training.
Epoch 1:
Epoch 5:
We trained on noise level of 0.5, but the model performs reasonably well on other noise levels.
Now we change the problem objective to predict noise. This is a more difficult problem, but more rubust as it allows us to sample from a pure noise image. This is the new objective function:
We add time conditioning by injecting it into the decoding blocks through fully connected blocks.
I follow the DDPM algorithm of picking a random clean image, adding in random noise, and training the unet to predict noise.
Training is noisier than before as it is a much harder task, but still achieves a good loss trend.
Now we can sample from pure noise! The results achieve very clean background and definition.
Sometimes the results from before don't look like anything. We can improve on this by adding class conditioning, which allows us to control the image generation. I did this by adding in two additional fully connected blocks that take in class label as one hot vectors. Dropout is also implemented such that 10% of class conditioning is masked out.
Training doesn't involve class conditioning.
During sampling, I used classifier-free guidance by calculating two noises, one without class conditioning and one with conditioning. The results look very good.
This project was rewarding, especially in the second part with debugging a model from scratch. There are many components to modeling from the building blocks, to training and sampling. It was helpful for me to go through the motions of debugging and pinpointing exactly where to look for mistakes, despite it being painful at times. I also appreciated being able to work with pretrained models in part 1!