Tutorial

Writing VGG from Scratch in PyTorch

Updated on April 7, 2025

Continuing my series on building classical convolutional neural networks that revolutionized the field of computer vision in the last 1-2 decades, we will next build VGG, a very deep convolutional neural network, from scratch using PyTorch—the previous articles in the series on my profile, mainly LeNet5 and AlexNet.

As before, we will be looking into the architecture and intuition behind VGG and how the results were at that time. We will then explore our dataset, CIFAR100, and load it into our program using memory-efficient code. Then, we will implement VGG16 (number refers to the number of layers; there are two versions, VGG16 and VGG19) from scratch using PyTorch and then train it in our dataset along with evaluating it on our test set to see how it performs on unseen data.

VGG

Building on the foundation established by AlexNet, VGG focuses on another important aspect of Convolutional Neural Networks (CNNs): depth. Developed by Simonyan and Zisserman, the VGG architecture typically consists of 16 convolutional layers but can also be extended to 19 layers, leading to the two versions known as VGG-16 and VGG-19. All convolutional layers use 3x3 filters. For more detailed information about the network, you can refer to the official paper titled “Very Deep Convolutional Networks for Large-Scale Image Recognition.”

VGG16 architecture. Source

Data Loading

Dataset

Before building the model, one of the most important things in any Machine Learning project is to load, analyze, and pre-process the dataset. In this article, we’ll be using the CIFAR-100 dataset. This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a “fine” label (the class to which it belongs) and a “coarse” label (the superclass to which it belongs). We’ll be using the “fine” label here. Here’s the list of classes in the CIFAR-100:

Class List for the CIFAR-100 dataset

Importing the libraries

We’ll be working mainly with torch (used for building the model and training), torchvision (for data loading/processing, which contains datasets and methods for processing those datasets in computer vision), and numpy (for mathematical manipulation). We will also define a variable device so the program can use GPU.

import numpy as np
import torch
import torch.nn as nn
from torchvision import datasets
from torchvision import transforms
from torch.utils.data.sampler import SubsetRandomSampler


# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

Loading the Data

torchvision is a library that provides easy access to tons of computer vision datasets and methods to pre-process these datasets in an easy and intuitive manner.

We define a function data_loader that returns either train/validation data or test data depending on the arguments
We start by defining the variable normalize with the mean and standard deviations of each of the channel (red, green, and blue) in the dataset. These can be calculated manually, but are also available online. This is used in the transform variable where we resize the data, convert it to tensors and then normalize it
If the test argument is true, we simply load the test split of the dataset and return it using data loaders (explained below)
In case test is false (default behaviour as well), we load the train split of the dataset and randomly split it into train and validation set (0.9:0.1)
Finally, we make use of data loaders. This might not affect the performance in the case of a small dataset like CIFAR100, but it can really impede the performance in case of large datasets and is generally considered a good practice. Data loaders allow us to iterate through the data in batches, and the data is loaded while iterating and not all at once in start into your RAM

def data_loader(data_dir,
                batch_size,
                random_seed=42,
                valid_size=0.1,
                shuffle=True,
                test=False):
  
    normalize = transforms.Normalize(
        mean=[0.4914, 0.4822, 0.4465],
        std=[0.2023, 0.1994, 0.2010],
    )

    # define transforms
    transform = transforms.Compose([
            transforms.Resize((227,227)),
            transforms.ToTensor(),
            normalize,
    ])

    if test:
        dataset = datasets.CIFAR100(
          root=data_dir, train=False,
          download=True, transform=transform,
        )

        data_loader = torch.utils.data.DataLoader(
            dataset, batch_size=batch_size, shuffle=shuffle
        )

        return data_loader

    # load the dataset
    train_dataset = datasets.CIFAR100(
        root=data_dir, train=True,
        download=True, transform=transform,
    )

    valid_dataset = datasets.CIFAR10(
        root=data_dir, train=True,
        download=True, transform=transform,
    )

    num_train = len(train_dataset)
    indices = list(range(num_train))
    split = int(np.floor(valid_size * num_train))

    if shuffle:
        np.random.seed(random_seed)
        np.random.shuffle(indices)

    train_idx, valid_idx = indices[split:], indices[:split]
    train_sampler = SubsetRandomSampler(train_idx)
    valid_sampler = SubsetRandomSampler(valid_idx)

    train_loader = torch.utils.data.DataLoader(
        train_dataset, batch_size=batch_size, sampler=train_sampler)
 
    valid_loader = torch.utils.data.DataLoader(
        valid_dataset, batch_size=batch_size, sampler=valid_sampler)

    return (train_loader, valid_loader)


# CIFAR100 dataset 
train_loader, valid_loader = data_loader(data_dir='./data',
                                         batch_size=64)

test_loader = data_loader(data_dir='./data',
                              batch_size=64,
                              test=True)

VGG16 from Scratch

To build the model from scratch, we need first to understand how model definitions work in torch and the different types of layers that we’ll be using here:

Every custom model must inherit from the nn.Module class provides essential functionality to aid in model training.
Secondly, we need to accomplish two main tasks. First, we should define the various layers of our model within the __init__ function. Next, we need to specify the sequence in which these layers will be applied to the input in the forward function.

Let’s now define the various types of layers that we are using here:

nn.Conv2d: These convolutional layers take input and output channel numbers as arguments, along with the filter’s kernel size. They can also include options for strides and padding if needed.
nn.BatchNorm2d: This applies batch normalization to the output of the convolutional layer.
nn.ReLU: This is the activation applied to various outputs in the network.
nn.MaxPool2d: This applies maximum pooling to the output using the specified kernel size.
nn.Dropout: This is used to apply dropout to the output with a specified probability.
nn.Linear: This is a fully connected layer.
nn.Sequential: This layer is unique as it helps combine different operations in the same step.

Using this knowledge, we can now build our VGG16 model using the architecture in the paper:

class VGG16(nn.Module):
    def __init__(self, num_classes=10):
        super(VGG16, self).__init__()
        self.layer1 = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU())
        self.layer2 = nn.Sequential(
            nn.Conv2d(64, 64, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(), 
            nn.MaxPool2d(kernel_size = 2, stride = 2))
        self.layer3 = nn.Sequential(
            nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU())
        self.layer4 = nn.Sequential(
            nn.Conv2d(128, 128, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size = 2, stride = 2))
        self.layer5 = nn.Sequential(
            nn.Conv2d(128, 256, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU())
        self.layer6 = nn.Sequential(
            nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU())
        self.layer7 = nn.Sequential(
            nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size = 2, stride = 2))
        self.layer8 = nn.Sequential(
            nn.Conv2d(256, 512, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(512),
            nn.ReLU())
        self.layer9 = nn.Sequential(
            nn.Conv2d(512, 512, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(512),
            nn.ReLU())
        self.layer10 = nn.Sequential(
            nn.Conv2d(512, 512, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(512),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size = 2, stride = 2))
        self.layer11 = nn.Sequential(
            nn.Conv2d(512, 512, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(512),
            nn.ReLU())
        self.layer12 = nn.Sequential(
            nn.Conv2d(512, 512, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(512),
            nn.ReLU())
        self.layer13 = nn.Sequential(
            nn.Conv2d(512, 512, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(512),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size = 2, stride = 2))
        self.fc = nn.Sequential(
            nn.Dropout(0.5),
            nn.Linear(7*7*512, 4096),
            nn.ReLU())
        self.fc1 = nn.Sequential(
            nn.Dropout(0.5),
            nn.Linear(4096, 4096),
            nn.ReLU())
        self.fc2= nn.Sequential(
            nn.Linear(4096, num_classes))
        
    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = self.layer3(out)
        out = self.layer4(out)
        out = self.layer5(out)
        out = self.layer6(out)
        out = self.layer7(out)
        out = self.layer8(out)
        out = self.layer9(out)
        out = self.layer10(out)
        out = self.layer11(out)
        out = self.layer12(out)
        out = self.layer13(out)
        out = out.reshape(out.size(0), -1)
        out = self.fc(out)
        out = self.fc1(out)
        out = self.fc2(out)
        return out

VGG16 from Scratch

Hyperparameters

Optimizing the hyperparameters is an important part of any machine or deep learning project. Here, we won’t experiment with different values for those, but we will have to define them beforehand. These include defining the number of epochs, batch size, learning rate, and loss function, along with the optimizer.

num_classes = 100
num_epochs = 20
batch_size = 16
learning_rate = 0.005

model = VGG16(num_classes).to(device)


# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, weight_decay = 0.005, momentum = 0.9)  


# Train the model
total_step = len(train_loader)

Setting the hyper-parameters

Training

We are now ready to train our model. We’ll first look into how we train our model in torch and then look at the code:

During each epoch, we iterate through the images and labels in train_loader, transferring them to the GPU if available. This process happens automatically.
The model generates predictions (model(images)), and we compute the loss between these predictions and the true labels using our loss function (criterion(outputs, labels)).
The computed loss is then used for backpropagation (loss.backward()), followed by updating the model’s weights (optimizer.step()). However, it’s important to reset the gradients before each update using optimizer.zero_grad().
After every epoch, we evaluate the model’s accuracy using the validation set. Since gradient calculations are not required during validation, we use torch.no_grad() to speed up the evaluation process.

Now, we combine all of this into the following code:

total_step = len(train_loader)

for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):  
        # Move tensors to the configured device
        images = images.to(device)
        labels = labels.to(device)
        
        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)
        
        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}' 
                   .format(epoch+1, num_epochs, i+1, total_step, loss.item()))
            
    # Validation
    with torch.no_grad():
        correct = 0
        total = 0
        for images, labels in valid_loader:
            images = images.to(device)
            labels = labels.to(device)
            outputs = model(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
            del images, labels, outputs
    
        print('Accuracy of the network on the {} validation images: {} %'.format(5000, 100 * correct / total))

Training

We can see the output of the above code as follows which does show that the model is actually learning as the loss is decreasing with every epoch:

Training Losses

Testing

For testing, we use the same code as validation but with the test_loader:

with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in test_loader:
        images = images.to(device)
        labels = labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()
        del images, labels, outputs

    print('Accuracy of the network on the {} test images: {} %'.format(10000, 100 * correct / total))

Testing

Using the above code and training the model for 20 epochs, we achieved an accuracy of 75% on the test set.

Conclusion

Let’s now conclude what we did in this article:

We started by understanding the architecture and different kinds of layers in the VGG-16 model
Next, we loaded and pre-processed the CIFAR100 dataset using torchvision
Then, we used PyTorch to build our VGG-16 model from scratch and understand the layers available in torch.
Finally, we trained and tested our model on the CIFAR100 dataset, which performed well on the test dataset with 75% accuracy.

Future Work

Using this article, you get a good introduction and hands-on learning, but you’ll learn much more if you extend this and see what you can do else:

You can use different datasets. One such dataset is CIFAR10, or a subset of the ImageNet dataset.
You can experiment with different hyperparameters and see the best combination for the model.
Finally, add or remove layers from the dataset to see their impact on the model’s capability. Better yet, try to build the VGG-19 version of this model.