Understanding Self-Supervised Learning II

Published on Tuesday, 08-07-2025

#Tutorials

image info

(Adopted from CS224N and MIT6S191)

Understanding Self-Supervised Learning II

Self-supervised learning (SSL) has emerged as a powerful paradigm in machine learning, enabling models to learn meaningful representations from unlabeled data. In this blog post, we’ll explore key concepts from a lecture on self-supervised learning, focusing on generative and discriminative models, autoencoders, variational autoencoders (VAEs), and their implementations using PyTorch. We’ll break down each concept clearly, provide mathematical derivations for VAEs, and include code snippets where applicable.

1. Supervised vs. Unsupervised Learning

Supervised Learning

Supervised learning involves training models on labeled data, where each input $x$ is paired with a label $y$ . The goal is to learn a function that maps inputs to outputs, such as in classification or regression tasks.

Data: Pairs of $(x, y)$ , where $y$ is the label.
Goal: Learn a mapping $f: x \rightarrow y$ .
Examples: Image classification, object detection, semantic segmentation.

Unsupervised Learning

Unsupervised learning deals with unlabeled data, where only inputs $x$ are available. The objective is to uncover hidden structures or patterns in the data.

Data: Only $x$ , no labels.
Goal: Discover underlying structures, such as clusters or latent representations.
Examples: Clustering, dimensionality reduction, density estimation.

PyTorch Example: Simple Supervised Learning

Here’s a basic PyTorch example for supervised learning using a linear model for regression:

import torch
import torch.nn as nn
import torch.optim as optim

# Generate synthetic data
X = torch.randn(100, 1)
y = 3 * X + 2 + torch.randn(100, 1) * 0.1

# Define a linear model
model = nn.Linear(1, 1)
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Training loop
for epoch in range(1000):
    optimizer.zero_grad()
    outputs = model(X)
    loss = criterion(outputs, y)
    loss.backward()
    optimizer.step()

print(f'Learned parameters: {model.weight.item():.2f}, {model.bias.item():.2f}')

This code trains a linear model to predict $y$ from $x$ , demonstrating supervised learning.

2. Generative vs. Discriminative Models

Discriminative Models

Discriminative models learn the conditional probability distribution $p(y|x)$ , focusing on mapping inputs to labels. They are typically used in supervised learning tasks.

Definition: Learn $p(y|x)$ .
Examples: Logistic regression, support vector machines, neural classifiers.

Generative Models

image info Generative models learn the joint probability distribution $p(x)$ or $p(x, y)$ , allowing them to generate new data samples. In self-supervised learning, they are used to model the data distribution without labels.

Definition: Learn $p(x)$ or $p(x|y)$ .
Examples: Variational autoencoders, generative adversarial networks (GANs).

Conditional Generative Models

These models learn $p(x|y)$ , generating data conditioned on specific labels or inputs, such as text-to-image models.

PyTorch Example: Simple Generative Model

Here’s a basic generative model using a neural network to model a 2D Gaussian distribution:

import torch
import torch.nn as nn
import torch.optim as optim
import torch.distributions as dist

# Generate synthetic 2D Gaussian data
data = torch.randn(1000, 2) * 0.5 + torch.tensor([2.0, 3.0])

# Define a generative model
class GenerativeModel(nn.Module):
    def __init__(self):
        super(GenerativeModel, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(2, 64),
            nn.ReLU(),
            nn.Linear(64, 2)
        )
    
    def forward(self, z):
        return self.fc(z)

model = GenerativeModel()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
for epoch in range(1000):
    optimizer.zero_grad()
    z = torch.randn(1000, 2)  # Latent noise
    generated = model(z)
    loss = -dist.Normal(generated, 0.5).log_prob(data).mean()
    loss.backward()
    optimizer.step()

print("Training complete!")

This model learns to generate samples from a 2D Gaussian distribution, illustrating the concept of generative modeling.

3. Autoencoders

image info

Autoencoders are neural networks designed to learn a compressed representation of the input data in an unsupervised manner. They consist of an encoder that maps the input to a latent space and a decoder that reconstructs the input from the latent representation.

Objective: Minimize reconstruction loss, i.e., ensure the output is as close as possible to the input.
Applications: Dimensionality reduction, denoising, feature learning.

PyTorch Example: Autoencoder

Here’s an implementation of a simple autoencoder for MNIST digits:

import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms

# Load MNIST dataset
transform = transforms.ToTensor()
train_dataset = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=128, shuffle=True)

# Define autoencoder
class Autoencoder(nn.Module):
    def __init__(self):
        super(Autoencoder, self).__init__()
        self.encoder = nn.Sequential(
            nn.Linear(28 * 28, 128),
            nn.ReLU(),
            nn.Linear(128, 32)
        )
        self.decoder = nn.Sequential(
            nn.Linear(32, 128),
            nn.ReLU(),
            nn.Linear(128, 28 * 28),
            nn.Sigmoid()
        )
    
    def forward(self, x):
        x = x.view(-1, 28 * 28)
        latent = self.encoder(x)
        reconstructed = self.decoder(latent)
        return reconstructed

model = Autoencoder()
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
for epoch in range(10):
    for data, _ in train_loader:
        optimizer.zero_grad()
        outputs = model(data)
        loss = criterion(outputs, data.view(-1, 28 * 28))
        loss.backward()
        optimizer.step()
    print(f'Epoch {epoch+1}, Loss: {loss.item():.4f}')

This autoencoder compresses MNIST images into a 32-dimensional latent space and reconstructs them, minimizing the mean squared error.

4. Variational Autoencoders (VAEs)

image info

Variational Autoencoders (VAEs) extend autoencoders by introducing a probabilistic approach to the latent space. Instead of mapping inputs to fixed latent vectors, VAEs learn a distribution (typically Gaussian) characterized by a mean $\mu$ and standard deviation $\sigma$ . This allows for sampling from the latent space to generate new data.

Key Components:
- Encoder: Outputs $\mu$ and $\sigma$ for the latent distribution.
- Sampling Layer: Samples $z = \mu + \sigma \cdot \epsilon$ , where $\epsilon \sim \mathcal{N}(0,1)$ .
- Decoder: Reconstructs the input from $z$ .
- Loss Function: Combines reconstruction loss (e.g., MSE) and a regularization term (KL-divergence) to ensure the latent distribution is close to a prior (usually $\mathcal{N}(0,1)$ ).

Why VAEs for Sampling?

Standard autoencoders (AEs) map inputs to a fixed point in the latent space, which is useful for reconstruction but not for generating new data. The latent space of an AE is often unstructured, with no guarantee that sampling arbitrary points will produce meaningful outputs. VAEs address this by modeling the latent space as a probability distribution, typically a Gaussian, defined by $\mu$ and $\sigma$ . This probabilistic approach ensures:

Smooth Latent Space: The latent space is continuous, meaning points close to each other produce similar outputs when decoded.
Generative Capability: By sampling from the prior distribution (e.g., $\mathcal{N}(0,1)$ ), VAEs can generate new data points that resemble the training data.
Regularization: The KL-divergence term in the loss function encourages the latent distribution to be close to a standard normal distribution, preventing the model from overfitting to specific points and ensuring the latent space is well-behaved for sampling.

For example, in an AE, sampling a random point in the latent space might produce gibberish because the latent space lacks structure. In a VAE, the encoder learns $q_{\phi}(z|x)$ , and the sampling process $z = \mu + \sigma \cdot \epsilon$ ensures that generated samples are meaningful, as the latent space is regularized to follow a known distribution.

Derivations of VAEs

To understand VAEs, we derive their objective function, the Evidence Lower Bound (ELBO), and explain the reparameterization trick, which enables training with gradient-based methods.

Objective: Maximizing the Marginal Likelihood

The goal of a VAE is to model the data distribution $p(x)$ by introducing a latent variable $z$ . We want to maximize the marginal likelihood $p_{\theta}(x)$ , which is the probability of the observed data $x$ under the model parameterized by $\theta$ :

p_{\theta}(x) = \int p_{\theta}(x|z) p(z) \, dz

where $p(z)$ is a prior distribution (typically $\mathcal{N}(0,1)$ ), and $p_{\theta}(x|z)$ is the likelihood of the data given the latent variable, modeled by the decoder.

However, computing this integral is intractable due to the high-dimensional latent space. VAEs use variational inference to approximate the true posterior $p_{\theta}(z|x)$ with a variational distribution $q_{\phi}(z|x)$ , parameterized by $\phi$ (the encoder).

Evidence Lower Bound (ELBO)

We aim to maximize the log-likelihood $\log p_{\theta}(x)$ . Using variational inference, we derive a lower bound on this quantity, known as the ELBO. Starting with the log-likelihood:

\log p_{\theta}(x) = \log \int p_{\theta}(x|z) p(z) \, dz

We introduce $q_{\phi}(z|x)$ and use Jensen’s inequality to derive the ELBO:

\log p_{\theta}(x) = \log \int q_{\phi}(z|x) \frac{p_{\theta}(x|z) p(z)}{q_{\phi}(z|x)} \, dz \geq \int q_{\phi}(z|x) \log \frac{p_{\theta}(x|z) p(z)}{q_{\phi}(z|x)} \, dz

This gives the ELBO:

\mathcal{L}(\theta, \phi; x) = \mathbb{E}_{q_{\phi}(z|x)} [\log p_{\theta}(x|z)] - \text{KL}(q_{\phi}(z|x) \| p(z))

The ELBO consists of two terms:

Reconstruction Loss: $\mathbb{E}_{q_{\phi}(z|x)} [\log p_{\theta}(x|z)]$ , which encourages the decoded samples to match the input data. For continuous data, this is often approximated by mean squared error or binary cross-entropy.
KL-Divergence: $\text{KL}(q_{\phi}(z|x) \| p(z))$ , which regularizes $q_{\phi}(z|x)$ to be close to the prior $p(z)$ , ensuring a structured latent space.

Assuming $q_{\phi}(z|x) = \mathcal{N}(\mu, \sigma^2)$ and $p(z) = \mathcal{N}(0,1)$ , the KL-divergence term has a closed-form expression:

\text{KL}(q_{\phi}(z|x) \| p(z)) = -\frac{1}{2} \sum_{i=1}^d \left( 1 + \log \sigma_i^2 - \mu_i^2 - \sigma_i^2 \right)

where $d$ is the dimensionality of the latent space, and $\mu_i$ , $\sigma_i$ are the mean and standard deviation of the $i$ -th dimension.

Reparameterization Trick

To make the ELBO differentiable, we need to sample $z \sim q_{\phi}(z|x)$ in a way that allows gradients to flow through the sampling process. Direct sampling from $\mathcal{N}(\mu, \sigma^2)$ is not differentiable, so we use the reparameterization trick:

z = \mu + \sigma \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0,1)

Here, $\mu$ and $\sigma$ are outputs of the encoder, and $\epsilon$ is a random variable. This reparameterization allows gradients to propagate through $\mu$ and $\sigma$ , enabling optimization via backpropagation.

Final Loss Function

The VAE loss is the negative ELBO, combining the reconstruction loss and KL-divergence:

\mathcal{L}_{\text{VAE}} = -\mathbb{E}_{q_{\phi}(z|x)} [\log p_{\theta}(x|z)] + \text{KL}(q_{\phi}(z|x) \| p(z))

In practice, for a batch of data, the expectation is approximated by a single sample, and the reconstruction loss is computed as binary cross-entropy or mean squared error.

Continuity and Completeness

VAEs aim for:

Continuity: Points close in latent space produce similar decoded outputs.
Completeness: Sampling from the latent space yields meaningful outputs.

PyTorch Example: Variational Autoencoder

Here’s a VAE implementation for MNIST, incorporating the derived loss function:

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms

# Load MNIST dataset
transform = transforms.ToTensor()
train_dataset = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=128, shuffle=True)

# Define VAE
class VAE(nn.Module):
    def __init__(self):
        super(VAE, self).__init__()
        self.encoder = nn.Sequential(
            nn.Linear(28 * 28, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU()
        )
        self.fc_mu = nn.Linear(64, 32)
        self.fc_logvar = nn.Linear(64, 32)
        self.decoder = nn.Sequential(
            nn.Linear(32, 64),
            nn.ReLU(),
            nn.Linear(64, 128),
            nn.ReLU(),
            nn.Linear(128, 28 * 28),
            nn.Sigmoid()
        )
    
    def reparameterize(self, mu, logvar):
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        return mu + eps * std
    
    def forward(self, x):
        x = x.view(-1, 28 * 28)
        h = self.encoder(x)
        mu = self.fc_mu(h)
        logvar = self.fc_logvar(h)
        z = self.reparameterize(mu, logvar)
        reconstructed = self.decoder(z)
        return reconstructed, mu, logvar

# Loss function
def vae_loss(reconstructed, x, mu, logvar):
    recon_loss = nn.functional.binary_cross_entropy(reconstructed, x.view(-1, 28 * 28), reduction='sum')
    kl_div = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
    return recon_loss + kl_div

model = VAE()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
for epoch in range(10):
    total_loss = 0
    for data, _ in train_loader:
        optimizer.zero_grad()
        reconstructed, mu, logvar = model(data)
        loss = vae_loss(reconstructed, data, mu, logvar)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f'Epoch {epoch+1}, Loss: {total_loss / len(train_loader.dataset):.4f}')

This VAE learns a probabilistic latent space, enabling both reconstruction and generation of MNIST digits.

5. Language Modeling with RNNs

The lecture mentions language modeling using recurrent neural networks (RNNs) to model the probability distribution $p(x)$ . RNNs are suited for sequential data, predicting the next token in a sequence based on previous tokens.

PyTorch Example: Simple RNN Language Model

Here’s a basic RNN language model for character-level text generation:

import torch
import torch.nn as nn
import torch.optim as optim

# Sample text data
text = "hello world"
chars = sorted(list(set(text)))
char_to_idx = {c: i for i, c in enumerate(chars)}
idx_to_char = {i: c for i, c in enumerate(chars)}

# Prepare data
input_seq = torch.tensor([char_to_idx[c] for c in text[:-1]], dtype=torch.long)
target_seq = torch.tensor([char_to_idx[c] for c in text[1:]], dtype=torch.long)

# Define RNN model
class RNNLanguageModel(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size):
        super(RNNLanguageModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.RNN(embed_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, vocab_size)
    
    def forward(self, x, hidden):
        x = self.embedding(x)
        out, hidden = self.rnn(x, hidden)
        out = self.fc(out)
        return out, hidden

model = RNNLanguageModel(len(chars), 10, 20)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

# Training loop
for epoch in range(1000):
    hidden = None
    optimizer.zero_grad()
    outputs, hidden = model(input_seq.unsqueeze(0), hidden)
    loss = criterion(outputs.squeeze(0), target_seq)
    loss.backward()
    optimizer.step()
    if epoch % 100 == 0:
        print(f'Epoch {epoch}, Loss: {loss.item():.4f}')

This model learns to predict the next character in the sequence, demonstrating generative modeling for text.

6. Latent Space Properties

For VAEs and similar models, the latent space should exhibit:

Independence: Latent variables should be uncorrelated, often enforced by a diagonal prior.
Disentanglement: Each latent dimension should control a distinct feature of the data.

These properties ensure that the latent space is interpretable and useful for generation. For example, the encoder in a VAE computes $q_{\phi}(z|x)$ , and the sampling layer uses:

z = \mu + \sigma \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0,1)

to generate latent vectors.

7. Conclusion

Self-supervised learning, through generative and discriminative models, enables powerful representation learning without labeled data. Autoencoders provide a foundation for learning compressed representations, while VAEs introduce probabilistic modeling for generation, with the ELBO and reparameterization trick ensuring a structured latent space via equations like $z = \mu + \sigma \cdot \epsilon$ . Language modeling with RNNs showcases generative modeling for sequential data. Using PyTorch, we can implement these concepts efficiently, leveraging neural networks to uncover hidden structures in data.

By understanding these models, their derivations, and their implementations, we can harness the potential of self-supervised learning for a wide range of applications, from image generation to natural language processing.

Thank you for reading! Feel free to experiment with the provided PyTorch code and explore further.