Silky Smooth Animation with Stable Diffusion & Friends
While I have been having a lot of fun lately creating my own models for use with Stable Diffusion, I am really bumping up against its video creation workflows.
Deforum's approach, while incredibly obtuse with its fields of decimal numbers, trigonometric functions, and JSON-formatted prompt timeline, actually makes a lot of sense to me. Don't get me wrong, the Stable Diffusion video tooling space could use a lot of work–but that isn't my stumbling block. The problem is that I'm not aesthetically interested in the result. Due to its process (taking a generated image, adding noise to it and then diffusing it again, repeat for x number of frames), it has a kind of unstable, jittery, or flickery quality to it.
I've seen some folks make this look work for them, but I haven't quite got the settings figured out to get something that I am happy with for my own work. It has an inherent stop motion kind of quality to this image-to-image approach and I am used to video synthesis being really flowy and organic. As organic as voltages can be, I guess.
My previous experiments with AI video have been focused around StyleGAN (StyleGAN2, specifically). My draw to StyleGAN is that the video is very flowy and organic. Sure, it has a characteristic "morphing" quality, but I'd take characteristic "morphing" over characteristic "jittering" at the moment.
Dealing with StyleGAN
With StyleGAN, the real power lies in training your own models. There are plenty of models out there to play with, but unless you want to be generating videos of someone else's very specifically curated dataset, you gotta make your own.
I'll be honest, making your own StyleGAN models is not easy or straightforward and barriers have only increased in the past couple months.
First, you need thousands of images. They need to be square. They need to be scaled to a power of 2 (128px, 256px, 512px, etc.). These images should ideally be aligned or positioned similarly. They need to be similar enough that StyleGAN can learn the generalities of the image content, but not too similar or you'll get overfitting.
Second, you need to train your model. At minimum this takes a few 24-hour days of constantly running code on a GPU. For better results, you're looking at a week or more. Up until a few months ago, an economic option for GPU access was Google Colab. In September, they changed their pricing model to be compute credit-based, throwing a lot of cold water on that approach. You'll burn through those credits in no time, even on one of their Pro+ plans. Be sure to save some credits, though–you'll need them to run the Colab notebooks to actually generate the video from your model.
It's really no wonder a lot of machine learning-based art has moved from StyleGAN to Stable Diffusion, Midjourney, and DALL-E. It's cheaper, quicker, and you get access to a far larger range of content.
Best of Both Worlds
So how do we get the silky smooth StyleGAN animation with Stable Diffusion? You train a StyleGAN model from Stable Diffusion images, of course.
Stable Diffusion is in many ways ideal for this purpose. Oh, you need a lot of 512px512px images? SD spits out 512px square images all day long. If you want them aligned, use img2img to generate similar compositions. Generate lots of variations and you won't have overfitting issues. Boom. Easy.
There's really no getting around the downsides of training, unfortunately. It takes time and costs money. That's just the reality. 🤷
Impasto Sunsets
When someone presents you with a magical art machine that can conjure images from whatever words you feed it, you may be inclined to feed it words that describe the impossible. The impossible for me is actually impasto painting. I'm not a great painter, and I've never really been able to pull off that thick and dimensional look.
Long before I had any kind of handle on prompt engineering, I just typed "impasto sunset landscape" and variations thereof and it was giving me back these really gorgeous, convincing images. So I tweaked the code of the Colab notebook I was using and had it generate 2,300 impasto sunsets. It took a little over an hour.
As I had been recently working in StyleGAN, I immediately made the connection "this should be a StyleGAN model". And lucky for me, I started this process at the tail end of of August 2022, a few weeks before Colab changed its pricing.
So I trained this model on and off for about a week and a half for a total count of 10,880 iterations. I probably didn't even need to train that long at all. As I was training off of an existing model (FFHQ), it learned remarkably quickly what an impasto sunset is. As it trained, it definitely picked up more details and increased variety in the model.
Okay, so now I have a StyleGAN model that can generate infinite images of impasto sunsets. Well, I already had that with Stable Diffusion. So let's do what SD can't (yet), and make it move organically:
Isn't that cool? I used the circular loop generation method to create a sunrise animation from the model. It's also a perfect loop. One of the great things you can do with StyleGAN that you can't with Stable Diffusion video is have it end where it starts. The above video loops perfectly and will run infinitely, if you so choose.
True to its name, a noise loop is noisier and more chaotic, but still far smoother that what SD can do. Watch the sun zip around the sky in a pleasantly unnatural way.
Speaking of unnatural, some of my favorite results were when I pushed the truncation value harder than normal and ended up in a much more abstract territory. It's definitely more textural and the colors shifted away from the sunset palette. At points it looks like an abalone shell or chromatography.
Is It Worth It?
The short answer is: if you like what you get out of it, then yes, the process is worth it. The longer answer is: it depends.
Training a model is not a guarantee. Beyond the technical requirements, StyleGAN definitely wants the form of its input data to meet certain criteria to get a satisfying result. Figuring out what that form is can take some trial and error.
My first experiments with training my own StyleGAN models weren't that successful, to be honest. I really wanted to create a model based on stills from my video synth work. I ended up with a couple different models, but because the dataset was too diverse StyleGAN didn't really seem to know what to do with it. And it showed in the images and videos it created.
Honestly, I think I lucked out with my impasto sunset prompt. Most of the images had a clear horizon line; a good amount had a sun; there was a diversity of colors, but they stuck to a similar palette.
If I were to train a new StyleGAN model based on Stable Diffusion outputs, I would use img2img or the Stable Diffusion 2 depth map model. Those approaches would generate images with a similar enough structure for StyleGAN to be effective.
It's a technique that definitely takes a lot of playing around. It also takes a lot of babysitting a GPU for a week or more. And at this point, it's going to take some cash to pull off.