This is an older video (3 years old, which is ancient by internet standards) but they break down the principle of GANs, or generative adversarial networks. and image generation specifically with just pen and paper which is pretty impressive.

I still find it wild that image generators came about when training to remove noise from an image. Yes, an AI image starts out as the static you used to see on TV that your parents told you not to stare at:

There’s a reason noise specifically was picked (it can be created algorithmically and therefore also undone reliably with algorithms), and the way you train an image creation network is twofold: a first network is instructed to add noise to an image whose content you know, keeping track of how much it added each time. So for any image this network noises up, you the researcher know: what the original depicts, how much noise it added overall, and at which pass/step of the noising process it’s at (with the last step being 100% noise).

Over X however many passes, it will go from 0 to 100% noise. It’s not linear as in adding 10% noise each time, there’s a few different algorithms to decide how much noise to add when, so you can tell it 100% will be at 24 passes for example, or 60 passes, and it might start with 1% or 2% noise increments at first then go from 90 to 100% noise in the final step. Different methods produce different results.

Anyway. When you’ve made this set, you then bring in a second model, and you give it these noisy images. You might start with a low-noise image where the original content is still largely recognizable, and you ask the model: this is a picture of a rabbit at step 4 of the noising process with X noising algorithm. Can you denoise it and bring it back to step 3?

And it will do it thanks to very advanced math.

Over time, the second model learns to undo higher and higher noise levels, but it still does them step by step. It’s literally reversing the noising process, which is also done step by step. Eventually you will be able to tell that model “this is a picture of a rabbit at step 24 of the noising process [100% noise], can you denoise it and bring it to step 23?”

But here’s the catch. Eventually, we stop giving this model noise it knows, and instead generate jpegs of noise that it’s never seen before. These are images that were only ever noise, there’s nothing in them but mathematically-arranged pixels.

This means there’s theoretically nothing for the network to denoise, so what happens? Well, the model still thinks it’s undoing noise, so it will find what it thinks is there - which is your prompt. When you write an image gen prompt, you are essentially telling the denoising model “this is a 100% noisy picture of a rabbit at final step N [you can select the number of steps when prompting], can you bring it to noise level N-1?” and then the process repeats to N-2, N-3, etc until it reaches step 0, i.e. 0% noise left.

And you’ve just generated an image out of pure noise.

You are literally tricking math into seeing something that’s not there. And TV static lol (okay it’s technically not TV static there’s more complex algorithms than gaussian noise but you get the gist).

If it still doesn’t make too much sense – there’s also a ton of other training that goes on to get convincing results – visualize it like this: when you give the AI 100% noise and tell it there’s a rabbit in there, it will look for patterns it knows from previous training. Your 100% noise has some ‘shapes’ in it, it’s not completely random distribution, it has math behind it that just makes it look like it’s random. If you look at the pic of static above you might see shapes too, they appear and disappear. So the model will find an area of pixels that kinda looks like the leg or ears of the rabbit based on what the patterns it successfully learned in training, and then decide “oh, I know this pattern, this is clearly the ears of the rabbit I’m supposed to find, let’s denoise it into what I know rabbit ears look like when they match this pattern of pixels”.