When training an autoencoder, the objective is for the model to reconstruct its own input. For this reason, the target value y is equal to the input itself. This makes autoencoders a form of unsupervised learning, since no external labels are required; the model simply learns to reproduce the data it receives.
One limitation of a standard autoencoder is that it is difficult to generate new realistic samples from it. This is because the distribution of the latent (hidden) representations is not explicitly constrained or known. As a result, randomly sampling from the latent space does not guarantee meaningful outputs. To address this issue, Variational Autoencoders (VAEs) were introduced. VAEs impose a probabilistic structure on the latent space, usually a standard normal distribution, making sampling possible and more reliable.
Another advantage of VAEs is that they are generally easier to train than Generative Adversarial Networks (GANs). While GANs rely on a delicate balance between two competing neural networks, VAEs optimize a single, well-defined objective function, which leads to more stable training.
In a VAE, new data samples are generated by first sampling a latent vector from a D-dimensional standard normal distribution and then passing this vector through the decoder. The VAE loss function includes not only a reconstruction term, such as mean squared error, but also a Kullback–Leibler (KL) divergence term. This additional term ensures that the learned latent distribution remains close to the desired prior distribution, enabling meaningful sampling.
Moving to diffusion models, these systems generate data by progressively removing noise from a sample. Neural networks are used specifically in the reverse diffusion process, where they learn how to denoise data step by step. The forward process, which adds noise, is fixed and does not involve learning. One drawback of diffusion models is the large number of reverse steps required for generation, but this can be mitigated through model distillation, where a secondary model is trained to produce similar results in fewer steps.
In Latent Diffusion Models (LDMs), such as Stable Diffusion, the diffusion process operates in a compressed latent space instead of directly on pixel data, improving efficiency. Text conditioning in these models is achieved by encoding text (for example, using CLIP embeddings) and injecting this information into a UNet architecture through cross-attention mechanisms. This allows the model to align generated images with the input text.
Finally, the dimensionality of the latent representation in a diffusion model remains consistent with the representation used (either the original data space or a compressed latent space). Stable Diffusion, in particular, is based on the Latent Diffusion Model (LDM) architecture, which combines efficiency with high-quality image generation.

© Image. https://ca.wikipedia.org/wiki/Autoencoder
Bonus
Written down your ideas.
@Yolanda Muriel Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)

