VAE — Fashion-MNIST (PyTorch)
Author: Felipe Maluli de Carvalho Dias Course Activity: Variational Autoencoder (VAE) — Individual Assignment
UNZIP
fashion-mnist_train.csvFILE IN DATA BEFORE CONTINUING, TOO LARGE FOR GIT
This page is the report for the VAE activity. The full implementation and all experiments are in solution_exercises.ipynb, and every figure below was generated from that notebook and saved into assets/.
Math Rendering Issue
If mathematical equations are not displaying properly, try refreshing the page (Cmd+R or Alt+F5) or reloading the browser tab.
1. Assignment & Scope
Dataset: Fashion-MNIST
Model: Variational Autoencoder (VAE) implemented in PyTorch
Delivery: GitHub Pages (this report) + repository with code and assets
From the assignment brief, this work implements:
- ✅ Data preparation: load, normalize to [0,1], train/validation split
- ✅ VAE model: encoder + decoder + reparameterization trick
- ✅ Training: ELBO loss (reconstruction + KL), monitoring of losses and reconstructions
- ✅ Evaluation: reconstructions on validation set and sampling from the prior
- ✅ Visualization: original vs reconstructed images and latent space plots
- ✅ Report: this page, summarizing findings, challenges and insights
Delivery/format constraints (individual work, GitHub Pages link, AI usage disclosure, etc.) are respected here; this text is what will be submitted as the report.
2. Data Preparation
- Loaded Fashion-MNIST (28×28 grayscale images, 10 classes).
- Converted images to tensors and normalized pixel values to the [0,1] range.
- Split the data into:
- Train: 90%
- Validation: 10%
This matches the requirement to load, normalize, and split the dataset before training the VAE.
3. VAE Architecture & Implementation
All models are implemented in PyTorch.
Encoder
- Flattens 28×28 images into vectors.
- Applies a sequence of linear layers with non-linearities to produce:
- \(\mu(x)\): mean vector of the approximate posterior \(q_\phi(z\mid x)\)
- \(\log \sigma^2(x)\): log-variance vector
Reparameterization Trick
To sample \(z\) in a differentiable way:
This allows gradients to flow through \(\mu\) and \(\sigma\) while still sampling stochastically, satisfying the assignment requirement to implement the reparameterization trick.
Decoder
- Takes latent vector \(z\) and maps it back to the image space using linear layers with non-linearities.
- Outputs logits for each pixel; the reconstruction loss is computed with BCEWithLogitsLoss, appropriate for normalized pixel intensities in \([0,1]\).
Loss (ELBO)
The training objective is the Evidence Lower BOund (ELBO):
$$ \text{ELBO}(x) = -\text{BCE}(x, \hat{x}) - \beta \, D_{\mathrm{KL}}!\left( q_\phi(z \mid x) \,|\, p(z) \right) $$ - Reconstruction loss: BCEWithLogits between original image and reconstruction. - Regularization: KL divergence between \(q_\phi(z \mid x)\) and \(p(z) = \mathcal{N}(0, I)\). - The code uses a standard \(\beta = 1\) VAE (no special weighting) unless configured otherwise.
This section ensures that the core requirement “VAE: encoder, decoder, reparameterization and ELBO” is fully implemented.
4. Training Procedure
- Optimizer: Adam (standard learning rate for VAEs on MNIST-like data).
- Tracked both training and validation ELBO over epochs.
- Periodically:
- Logged loss values
- Generated reconstructions on a fixed validation batch
- Sampled from the prior to visually check diversity and quality of generated images
Training curves (VAE):
These plots demonstrate that the model is learning (loss decreases and stabilizes), aligning with the rubric item on training and evaluation quality.
5. Results & Visualizations
5.1 Reconstructions (Original vs VAE Output)
To assess reconstruction quality, I passed validation images through the encoder–decoder pipeline and compared them visually.
Reconstructions (validation):
Findings:
- The VAE captures the overall shape and category of most items (e.g., shoes vs shirts).
- Fine details like textures, small patterns, and sharp edges tend to be blurry, which is expected from VAEs due to the KL regularization pushing towards smoother latent representations.
- Some classes with very distinct silhouettes (e.g., boots) are reconstructed more clearly than those that differ mainly by subtle texture differences.
5.2 Latent Space Visualization
The model uses a low-dimensional latent space (e.g., 2-D) or, for higher dimensions, a PCA reduction to 2-D for visualization. I encoded validation examples and plotted the mean \(\mu\) of each sample in the latent space, colored by class.
Latent space (μ):
Insights from the latent space:
- Points belonging to the same class tend to form loose clusters in the latent space.
- Some classes that are visually similar in Fashion-MNIST (e.g., different types of tops) partially overlap, which matches our intuition about the dataset.
- The latent representation seems to organize items roughly by shape and style, even though no labels are used in the VAE training.
5.3 Sampling from the Prior
To evaluate the generative side of the VAE, I sampled \(z \sim \mathcal{N}(0, I)\) from the prior and passed these latent vectors through the decoder:
Samples from prior:
Observations:
- Many sampled images are recognizable as Fashion-MNIST categories (e.g., shoes, tops), indicating that the decoder has learned a coherent mapping from the latent space to the image space.
- As with reconstructions, samples are somewhat smooth/blurred, which is typical of standard VAEs.
- Occasionally there are ambiguous images, which is expected when sampling from less populated regions of the latent space.
6. Challenges Faced & Insights Gained
6.1 Challenges
- Balancing Reconstruction vs KL
- If the KL term dominated too much early in training, reconstructions became almost meaningless (posterior collapse).
-
If KL was too weak, the model over-fit reconstructions, but the latent space became less smooth and sampling quality degraded.
-
Stability of Training
- With learning rates that were too high, training could oscillate and the validation loss would not improve.
-
Using a moderate learning rate and tracking both train and validation curves helped detect overfitting and instability.
-
Latent Dimension Choice
- Very low dimensional latent spaces (e.g., 2-D) are great for visualization but can limit reconstruction quality.
- Higher dimensional spaces improve reconstruction but require dimensionality reduction (PCA/t-SNE/UMAP) for visualization.
6.2 Insights
-
Trade-off between fidelity and structure:
VAEs explicitly trade some reconstruction fidelity for a nicely structured, continuous latent space. This is visible in the smoother, blurrier outputs and the meaningful latent clusters. -
Latent space as a semantic map:
Even though the model is trained without labels, similar classes end up near each other in the latent space. This reinforces the idea that VAEs can learn a semantic representation of the dataset. -
Sampling vs reconstruction:
Good reconstructions do not automatically guarantee good samples. Monitoring both helped diagnose whether the latent space was actually well-aligned with the prior.
7. Extra Experiments: Autoencoder (AE) Baseline
For extra credit, I implemented a standard Autoencoder (AE) on the same dataset to compare with the VAE.
AE baseline visualizations:
Comparison with VAE:
- The AE generally produces sharper reconstructions than the VAE, especially for fine details, because it does not regularize the latent distribution to match a prior.
- However, sampling from the AE is not straightforward: its latent space is not trained to follow a simple distribution like $$p(z) = \mathcal{N}(0, I) $$, so random latent vectors often decode to noisy or meaningless images.
- The VAE, in contrast, offers slightly blurrier images but a much more usable latent space for generation and interpolation.
9. How to Reproduce
python -m venv env
source env/bin/activate # Windows: env\Scripts\activate
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install pandas numpy matplotlib
jupyter notebook solution_exercises.ipynb
Running the notebook regenerates:
- Training and validation curves
- Reconstruction grids
- Latent space plots
- Prior samples
- AE baseline figures (if the AE cell is run)
AI Use
AI tools were used to assist in the implementation and in drafting this report (structure, phrasing, and code suggestions). All code and analysis were reviewed and understood by me, and I am able to explain each part of the implementation and the conclusions drawn.







