Understanding Self-Supervised Learning II
Published on Tuesday, 08-07-2025

(Adopted from CS224N and MIT6S191)
Understanding Self-Supervised Learning II
Self-supervised learning (SSL) has emerged as a powerful paradigm in machine learning, enabling models to learn meaningful representations from unlabeled data. In this blog post, we’ll explore key concepts from a lecture on self-supervised learning, focusing on generative and discriminative models, autoencoders, variational autoencoders (VAEs), and their implementations using PyTorch. We’ll break down each concept clearly, provide mathematical derivations for VAEs, and include code snippets where applicable.
1. Supervised vs. Unsupervised Learning
Supervised Learning
Supervised learning involves training models on labeled data, where each input is paired with a label . The goal is to learn a function that maps inputs to outputs, such as in classification or regression tasks.
- Data: Pairs of , where is the label.
- Goal: Learn a mapping .
- Examples: Image classification, object detection, semantic segmentation.
Unsupervised Learning
Unsupervised learning deals with unlabeled data, where only inputs are available. The objective is to uncover hidden structures or patterns in the data.
- Data: Only , no labels.
- Goal: Discover underlying structures, such as clusters or latent representations.
- Examples: Clustering, dimensionality reduction, density estimation.
PyTorch Example: Simple Supervised Learning
Here’s a basic PyTorch example for supervised learning using a linear model for regression:
import torch
import torch.nn as nn
import torch.optim as optim
# Generate synthetic data
X = torch.randn(100, 1)
y = 3 * X + 2 + torch.randn(100, 1) * 0.1
# Define a linear model
model = nn.Linear(1, 1)
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)
# Training loop
for epoch in range(1000):
optimizer.zero_grad()
outputs = model(X)
loss = criterion(outputs, y)
loss.backward()
optimizer.step()
print(f'Learned parameters: {model.weight.item():.2f}, {model.bias.item():.2f}') This code trains a linear model to predict from , demonstrating supervised learning.
2. Generative vs. Discriminative Models
Discriminative Models
Discriminative models learn the conditional probability distribution , focusing on mapping inputs to labels. They are typically used in supervised learning tasks.
- Definition: Learn .
- Examples: Logistic regression, support vector machines, neural classifiers.
Generative Models
Generative models learn the joint probability distribution or , allowing them to generate new data samples. In self-supervised learning, they are used to model the data distribution without labels.
- Definition: Learn or .
- Examples: Variational autoencoders, generative adversarial networks (GANs).
Conditional Generative Models
These models learn , generating data conditioned on specific labels or inputs, such as text-to-image models.
PyTorch Example: Simple Generative Model
Here’s a basic generative model using a neural network to model a 2D Gaussian distribution:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.distributions as dist
# Generate synthetic 2D Gaussian data
data = torch.randn(1000, 2) * 0.5 + torch.tensor([2.0, 3.0])
# Define a generative model
class GenerativeModel(nn.Module):
def __init__(self):
super(GenerativeModel, self).__init__()
self.fc = nn.Sequential(
nn.Linear(2, 64),
nn.ReLU(),
nn.Linear(64, 2)
)
def forward(self, z):
return self.fc(z)
model = GenerativeModel()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Training loop
for epoch in range(1000):
optimizer.zero_grad()
z = torch.randn(1000, 2) # Latent noise
generated = model(z)
loss = -dist.Normal(generated, 0.5).log_prob(data).mean()
loss.backward()
optimizer.step()
print("Training complete!") This model learns to generate samples from a 2D Gaussian distribution, illustrating the concept of generative modeling.
3. Autoencoders

Autoencoders are neural networks designed to learn a compressed representation of the input data in an unsupervised manner. They consist of an encoder that maps the input to a latent space and a decoder that reconstructs the input from the latent representation.
- Objective: Minimize reconstruction loss, i.e., ensure the output is as close as possible to the input.
- Applications: Dimensionality reduction, denoising, feature learning.
PyTorch Example: Autoencoder
Here’s an implementation of a simple autoencoder for MNIST digits:
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
# Load MNIST dataset
transform = transforms.ToTensor()
train_dataset = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=128, shuffle=True)
# Define autoencoder
class Autoencoder(nn.Module):
def __init__(self):
super(Autoencoder, self).__init__()
self.encoder = nn.Sequential(
nn.Linear(28 * 28, 128),
nn.ReLU(),
nn.Linear(128, 32)
)
self.decoder = nn.Sequential(
nn.Linear(32, 128),
nn.ReLU(),
nn.Linear(128, 28 * 28),
nn.Sigmoid()
)
def forward(self, x):
x = x.view(-1, 28 * 28)
latent = self.encoder(x)
reconstructed = self.decoder(latent)
return reconstructed
model = Autoencoder()
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Training loop
for epoch in range(10):
for data, _ in train_loader:
optimizer.zero_grad()
outputs = model(data)
loss = criterion(outputs, data.view(-1, 28 * 28))
loss.backward()
optimizer.step()
print(f'Epoch {epoch+1}, Loss: {loss.item():.4f}') This autoencoder compresses MNIST images into a 32-dimensional latent space and reconstructs them, minimizing the mean squared error.
4. Variational Autoencoders (VAEs)

Variational Autoencoders (VAEs) extend autoencoders by introducing a probabilistic approach to the latent space. Instead of mapping inputs to fixed latent vectors, VAEs learn a distribution (typically Gaussian) characterized by a mean and standard deviation . This allows for sampling from the latent space to generate new data.
- Key Components:
- Encoder: Outputs and for the latent distribution.
- Sampling Layer: Samples , where .
- Decoder: Reconstructs the input from .
- Loss Function: Combines reconstruction loss (e.g., MSE) and a regularization term (KL-divergence) to ensure the latent distribution is close to a prior (usually ).
Why VAEs for Sampling?
Standard autoencoders (AEs) map inputs to a fixed point in the latent space, which is useful for reconstruction but not for generating new data. The latent space of an AE is often unstructured, with no guarantee that sampling arbitrary points will produce meaningful outputs. VAEs address this by modeling the latent space as a probability distribution, typically a Gaussian, defined by and . This probabilistic approach ensures:
- Smooth Latent Space: The latent space is continuous, meaning points close to each other produce similar outputs when decoded.
- Generative Capability: By sampling from the prior distribution (e.g., ), VAEs can generate new data points that resemble the training data.
- Regularization: The KL-divergence term in the loss function encourages the latent distribution to be close to a standard normal distribution, preventing the model from overfitting to specific points and ensuring the latent space is well-behaved for sampling.
For example, in an AE, sampling a random point in the latent space might produce gibberish because the latent space lacks structure. In a VAE, the encoder learns , and the sampling process ensures that generated samples are meaningful, as the latent space is regularized to follow a known distribution.
Derivations of VAEs
To understand VAEs, we derive their objective function, the Evidence Lower Bound (ELBO), and explain the reparameterization trick, which enables training with gradient-based methods.
Objective: Maximizing the Marginal Likelihood
The goal of a VAE is to model the data distribution by introducing a latent variable . We want to maximize the marginal likelihood , which is the probability of the observed data under the model parameterized by :
where is a prior distribution (typically ), and is the likelihood of the data given the latent variable, modeled by the decoder.
However, computing this integral is intractable due to the high-dimensional latent space. VAEs use variational inference to approximate the true posterior with a variational distribution , parameterized by (the encoder).
Evidence Lower Bound (ELBO)
We aim to maximize the log-likelihood . Using variational inference, we derive a lower bound on this quantity, known as the ELBO. Starting with the log-likelihood:
We introduce and use Jensen’s inequality to derive the ELBO:
This gives the ELBO:
The ELBO consists of two terms:
- Reconstruction Loss: , which encourages the decoded samples to match the input data. For continuous data, this is often approximated by mean squared error or binary cross-entropy.
- KL-Divergence: , which regularizes to be close to the prior , ensuring a structured latent space.
Assuming and , the KL-divergence term has a closed-form expression:
where is the dimensionality of the latent space, and , are the mean and standard deviation of the -th dimension.
Reparameterization Trick
To make the ELBO differentiable, we need to sample in a way that allows gradients to flow through the sampling process. Direct sampling from is not differentiable, so we use the reparameterization trick:
Here, and are outputs of the encoder, and is a random variable. This reparameterization allows gradients to propagate through and , enabling optimization via backpropagation.
Final Loss Function
The VAE loss is the negative ELBO, combining the reconstruction loss and KL-divergence:
In practice, for a batch of data, the expectation is approximated by a single sample, and the reconstruction loss is computed as binary cross-entropy or mean squared error.
Continuity and Completeness
VAEs aim for:
- Continuity: Points close in latent space produce similar decoded outputs.
- Completeness: Sampling from the latent space yields meaningful outputs.
PyTorch Example: Variational Autoencoder
Here’s a VAE implementation for MNIST, incorporating the derived loss function:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
# Load MNIST dataset
transform = transforms.ToTensor()
train_dataset = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=128, shuffle=True)
# Define VAE
class VAE(nn.Module):
def __init__(self):
super(VAE, self).__init__()
self.encoder = nn.Sequential(
nn.Linear(28 * 28, 128),
nn.ReLU(),
nn.Linear(128, 64),
nn.ReLU()
)
self.fc_mu = nn.Linear(64, 32)
self.fc_logvar = nn.Linear(64, 32)
self.decoder = nn.Sequential(
nn.Linear(32, 64),
nn.ReLU(),
nn.Linear(64, 128),
nn.ReLU(),
nn.Linear(128, 28 * 28),
nn.Sigmoid()
)
def reparameterize(self, mu, logvar):
std = torch.exp(0.5 * logvar)
eps = torch.randn_like(std)
return mu + eps * std
def forward(self, x):
x = x.view(-1, 28 * 28)
h = self.encoder(x)
mu = self.fc_mu(h)
logvar = self.fc_logvar(h)
z = self.reparameterize(mu, logvar)
reconstructed = self.decoder(z)
return reconstructed, mu, logvar
# Loss function
def vae_loss(reconstructed, x, mu, logvar):
recon_loss = nn.functional.binary_cross_entropy(reconstructed, x.view(-1, 28 * 28), reduction='sum')
kl_div = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
return recon_loss + kl_div
model = VAE()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Training loop
for epoch in range(10):
total_loss = 0
for data, _ in train_loader:
optimizer.zero_grad()
reconstructed, mu, logvar = model(data)
loss = vae_loss(reconstructed, data, mu, logvar)
loss.backward()
optimizer.step()
total_loss += loss.item()
print(f'Epoch {epoch+1}, Loss: {total_loss / len(train_loader.dataset):.4f}') This VAE learns a probabilistic latent space, enabling both reconstruction and generation of MNIST digits.
5. Language Modeling with RNNs
The lecture mentions language modeling using recurrent neural networks (RNNs) to model the probability distribution . RNNs are suited for sequential data, predicting the next token in a sequence based on previous tokens.
PyTorch Example: Simple RNN Language Model
Here’s a basic RNN language model for character-level text generation:
import torch
import torch.nn as nn
import torch.optim as optim
# Sample text data
text = "hello world"
chars = sorted(list(set(text)))
char_to_idx = {c: i for i, c in enumerate(chars)}
idx_to_char = {i: c for i, c in enumerate(chars)}
# Prepare data
input_seq = torch.tensor([char_to_idx[c] for c in text[:-1]], dtype=torch.long)
target_seq = torch.tensor([char_to_idx[c] for c in text[1:]], dtype=torch.long)
# Define RNN model
class RNNLanguageModel(nn.Module):
def __init__(self, vocab_size, embed_size, hidden_size):
super(RNNLanguageModel, self).__init__()
self.embedding = nn.Embedding(vocab_size, embed_size)
self.rnn = nn.RNN(embed_size, hidden_size, batch_first=True)
self.fc = nn.Linear(hidden_size, vocab_size)
def forward(self, x, hidden):
x = self.embedding(x)
out, hidden = self.rnn(x, hidden)
out = self.fc(out)
return out, hidden
model = RNNLanguageModel(len(chars), 10, 20)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)
# Training loop
for epoch in range(1000):
hidden = None
optimizer.zero_grad()
outputs, hidden = model(input_seq.unsqueeze(0), hidden)
loss = criterion(outputs.squeeze(0), target_seq)
loss.backward()
optimizer.step()
if epoch % 100 == 0:
print(f'Epoch {epoch}, Loss: {loss.item():.4f}') This model learns to predict the next character in the sequence, demonstrating generative modeling for text.
6. Latent Space Properties
For VAEs and similar models, the latent space should exhibit:
- Independence: Latent variables should be uncorrelated, often enforced by a diagonal prior.
- Disentanglement: Each latent dimension should control a distinct feature of the data.
These properties ensure that the latent space is interpretable and useful for generation. For example, the encoder in a VAE computes , and the sampling layer uses:
to generate latent vectors.
7. Conclusion
Self-supervised learning, through generative and discriminative models, enables powerful representation learning without labeled data. Autoencoders provide a foundation for learning compressed representations, while VAEs introduce probabilistic modeling for generation, with the ELBO and reparameterization trick ensuring a structured latent space via equations like . Language modeling with RNNs showcases generative modeling for sequential data. Using PyTorch, we can implement these concepts efficiently, leveraging neural networks to uncover hidden structures in data.
By understanding these models, their derivations, and their implementations, we can harness the potential of self-supervised learning for a wide range of applications, from image generation to natural language processing.
Thank you for reading! Feel free to experiment with the provided PyTorch code and explore further.