Colab Notebook


  • This tutorial offers an overview of the preliminary setup, training process,loss functions and optimizers in PyTorch.
  • We cover a practical demonstration of PyTorch with an example from Vision and another from NLP.

Getting started

Creating a virtual environment

  • To accommodate the fact that different projects you’ll be working on utilize different versions of Python modules, it is a good practice to have multiple virtual environments to work on different projects.
  • Python Setup: Remote vs. Local offers an in-depth coverage of the various remote and local options available.

Using a GPU?

  • Note that your GPU needs to be set up first (drivers, CUDA and CuDNN).
  • For PyTorch, code changes are needed to support a GPU (unlike TensorFlow which can transparently handle GPU-usage) – follow the instructions here.
  • We recommend the following code hierarchy to organize your data, model code, experiments, results and logs:
  • Purpose each file or directory serves:
    • data/: will contain all the data of the project (generally not stored on GitHub), with an explicit train/dev/test split.
    • experiments: contains the different experiments (will be explained in the following section).
    • model/: module defining the model and functions used in train or eval. Different for our PyTorch and TensorFlow examples.
    • creates or transforms the dataset, build the split into train/dev/test.
    • train the model on the input data, and evaluate each epoch on the dev set.
    • run multiple times with different hyperparameters.
    • explore different experiments in a directory and display a nice table of the results.
    • evaluate the model on the test set (should be run once at the end of your project).

Running experiments

  • Now that we have a recommended have understood the structure of the code, we can try to train a model on the data, using the script:
python --model_dir experiments/base_model
  • We need to pass the model directory in argument, where the hyperparameters are stored in a JSON file named params.json. Different experiments will be stored in different directories, each with their own params.json file. Here is an example:


"learning_rate": 1e-3,
"batch_size": 32,
"num_epochs": 20

The structure of experiments after running a few different models might look like this (try to give meaningful names to the directories depending on what experiment you are running):


Each directory after training will contain multiple things:

  • params.json: the list of hyperparameters, in JSON format
  • train.log: the training log (everything we print to the console)
  • train_summaries: train summaries for TensorBoard (TensorFlow only)
  • eval_summaries: eval summaries for TensorBoard (TensorFlow only)
  • last_weights: weights saved from the 5 last epochs
  • best_weights: best weights (based on dev accuracy)

Training and evaluation

  • To train a model with the parameters provided in the configuration file experiments/base_model/params.json, the user-interface should be:
python --model_dir experiments/base_model
  • Once training is done, we can evaluate on the test set using:
python --model_dir experiments/base_model
  • We provide an example that will call with different values of learning rate. We first create a directory with a params.json file that contains the other hyperparameters.
  • Next, call python python --parent_dir experiments/learning_rate to train and evaluate a model with different values of learning rate defined in This will create a new directory for each experiment under experiments/learning_rate/.

  • The output would resemble the hierarchy below:


Display the results of multiple experiments

  • If you want to aggregate the metrics computed in each experiment (the metrics_eval_best_weights.json files), simply run:
python --parent_dir experiments/learning_rate
  • It will display a table synthesizing the results like this that is compatible with markdown:
  accuracy loss
base_model 0.989 0.0550
learning_rate/learning_rate_0.01 0.939 0.0324
learning_rate/learning_rate_0.001 0.979 0.0623

PyTorch Introduction

Goals of this tutorial

  • Learn more about PyTorch.
  • Learn an example of how to correctly structure a deep learning project in PyTorch.
  • Understand the key aspects of the code well-enough to modify it to suit your needs.


  • The main PyTorch homepage.
  • The official tutorials cover a wide variety of use cases- attention based sequence to sequence models, Deep Q-Networks, neural transfer and much more!
  • A quick crash course in PyTorch.
  • Justin Johnson’s repository that introduces fundamental PyTorch concepts through self-contained examples.
  • Tons of resources in this list.

Code Layout

  • We recommend the following code hierarchy to organize your data, model code, experiments, results and logs:
  • model/ specifies the neural network architecture, the loss function and evaluation metrics
  • model/ specifies how the data should be fed to the network
  • contains the main training loop
  • contains the main loop for evaluating the model
  • utility functions for handling hyperparams/logging/storing model

We recommend reading through to get a high-level overview.

Once you get the high-level idea, depending on your task and dataset, you might want to modify

  • model/ to change the model, i.e., how you transform your input into your prediction as well as your loss, etc.
  • model/ to change the way you feed data to the model.
  • and to make changes specific to your problem, if required

Tensors and variables

  • Before going further, we strongly suggest going through 60 Minute Blitz with PyTorch to gain an understanding of PyTorch basics. Here’s a sneak peak.

  • PyTorch Tensors are similar in behavior to NumPy’s arrays.

import torch
a = torch.Tensor([[1,2],[3,4]])
print(a)    # Prints a torch.FloatTensor of size 2x2 
            # tensor([[1., 2.],
            #         [3., 4.]])

print(a**2) # Prints a torch.FloatTensor of size 2x2 
            # tensor([[ 1.,  4.],
            #         [ 9., 16.]])
  • PyTorch Variables allow you to wrap a Tensor and record operations performed on it. This allows you to perform automatic differentiation.
from torch.autograd import Variable
a = Variable(torch.Tensor([[1, 2], [3, 4]]), requires_grad=True)
print(a)            # Prints a torch.FloatTensor of size 2x2 
                    # tensor([[1., 2.],
                    #         [3., 4.]], requires_grad=True)

y = torch.sum(a**2) # 1 + 4 + 9 + 16
print(y)            # Prints a [torch.FloatTensor of size 1]
                    # tensor(30., grad_fn=<SumBackward0>)

y.backward()        # compute gradients of y wrt a
print(a.grad)       # print dy/da_ij = 2*a_ij for a_11, a_12, a21, a22
                    # Prints [torch.FloatTensor of size 2x2]
                    # tensor([[2., 4.], 
                    #         [6., 8.]])
  • This prelude should give you a sense of the things to come. PyTorch packs elegance and expressiveness in its minimalist and intuitive syntax. Make sure to familiarize yourself with some more examples from the resources section before moving ahead.

Core training step

  • Let’s begin with a look at what the heart of our training algorithm looks like. The five lines below pass a batch of inputs through the model, calculate the loss, perform backpropagation and update the parameters.
output_batch = model(train_batch)           # compute model output
loss = loss_fn(output_batch, labels_batch)  # calculate loss

optimizer.zero_grad()  # clear previous gradients
loss.backward()        # compute gradients of all variables wrt loss

optimizer.step()       # perform updates using calculated gradients
  • Each of the variables train_batch, labels_batch, output_batch and loss is a PyTorch Variable and allows derivatives to be automatically calculated.

  • All the other code that we write is built around this – the exact specification of the model, how to fetch a batch of data and labels, computation of the loss and the details of the optimizer. Next, we’ll cover how to write a simple model in PyTorch, compute the loss and define an optimizer. The subsequent posts each cover a case of fetching data – one for image data and another for text data.

  • Key takeaways

    • The training process consists of three major components in the following order: opt.zero_grad(), loss.backward() and opt.step().
    • zero_grad() clears old gradients from the last step (otherwise you’d just accumulate the gradients from all loss.backward() calls).
    • loss.backward() computes the derivative of the loss w.r.t. the parameters (or any function requiring gradients) using backpropagation.
    • opt.step() causes the optimizer to take a step based on the gradients of the parameters.

Models in PyTorch

  • A model can be defined in PyTorch by subclassing the torch.nn.Module class. The model is defined in two steps. We first specify the parameters of the model, and then outline how they are applied to the inputs. For operations that do not involve trainable parameters (activation functions such as ReLU, operations like maxpool), we generally use the torch.nn.functional module.
  • Here’s an example of a single hidden layer neural network borrowed from here:
import torch.nn as nn
import torch.nn.functional as F

class TwoLayerNet(nn.Module):
    def __init__(self, D_in, H, D_out):
        In the constructor we instantiate two nn.Linear modules and assign them as
        member variables.

        D_in: input dimension
        H: dimension of hidden layer
        D_out: output dimension
        super(TwoLayerNet, self).__init__()
        self.linear1 = nn.Linear(D_in, H) 
        self.linear2 = nn.Linear(H, D_out)

def forward(self, x):
        In the forward function we accept a Variable of input data and we must 
        return a Variable of output data. We can use Modules defined in the 
        constructor as well as arbitrary operators on Variables.
        h_relu = F.relu(self.linear1(x))
        y_pred = self.linear2(h_relu)
        return y_pred
  • The __init__ function initializes the two linear layers of the model. PyTorch takes care of the proper initialization of the parameters you specify. In the forward function, we first apply the first linear layer, apply ReLU activation and then apply the second linear layer. The module assumes that the first dimension of x is the batch size. If the input to the network is simply a vector of dimension \(100\), and the batch size is \(32\), then the dimension of x would be \(32,100\). Let’s see an example of how to define a model and compute a forward pass:
# N is batch size; D_in is input dimension;
# H is the dimension of the hidden layer; D_out is output dimension.
N, D_in, H, D_out = 32, 100, 50, 10

# Create random Tensors to hold inputs and outputs, and wrap them in Variables
x = Variable(torch.randn(N, D_in)) # dim: 32 x 100

# Construct our model by instantiating the class defined above
model = TwoLayerNet(D_in, H, D_out)

# Forward pass: Compute predicted y by passing x to the model
y_pred = model(x) # dim: 32 x 10
  • More complex models follow the same layout, and we’ll see two of them in the subsequent posts.

Loss functions

  • PyTorch comes with many standard loss functions available for you to use in the torch.nn module. From the documentation, here’s a gist of what PyTorch has to offer in terms of loss functions:
Loss function Description
nn.L1Loss() Creates a criterion that measures the mean absolute error (MAE) between each element in the input \(x\) and target \(y\).
nn.MSELoss() Creates a criterion that measures the mean squared error (squared L2 norm) between each element in the input \(x\) and target \(y\).
nn.CrossEntropyLoss() This criterion combines nn.LogSoftmax() and nn.NLLLoss() in one single class.
nn.CTCLoss() The Connectionist Temporal Classification loss.
nn.NLLLoss() The negative log likelihood loss.
nn.PoissonNLLLoss() Negative log likelihood loss with Poisson distribution of target.
nn.KLDivLoss() The Kullback-Leibler divergence loss measure.
nn.BCELoss() Creates a criterion that measures the Binary Cross Entropy between the target and the output.
nn.BCEWithLogitsLoss() This loss combines a Sigmoid layer and the BCELoss in one single class.
nn.MarginRankingLoss() Creates a criterion that measures the loss given inputs \(x_1, x_2\), two 1D mini-batch Tensors, and a label 1D mini-batch tensor \(y\) (containing \(1\) or \(-1)\).
nn.HingeEmbeddingLoss() Measures the loss given an input tensor \(x\) and a labels tensor \(y\) (containing 1 or -1).
nn.MultiLabelMarginLoss() Creates a criterion that optimizes a multi-class multi-classification hinge loss (margin-based loss) between input \(x\) (a 2D mini-batch Tensor) and output yy (which is a 2D Tensor of target class indices).
nn.SmoothL1Loss() Creates a criterion that uses a squared term if the absolute element-wise error falls below \(1\) and an L1 term otherwise.
nn.SoftMarginLoss() Creates a criterion that optimizes a two-class classification logistic loss between input tensor xx and target tensor \(y\) (containing \(1\) or \(-1\).
nn.MultiLabelSoftMarginLoss() Creates a criterion that optimizes a multi-label one-versus-all loss based on max-entropy, between input \(x\) and target \(y\) of size \((N, C)\).
nn.CosineEmbeddingLoss() Creates a criterion that measures the loss given input tensors \(x_1, x_2\) and a Tensor label \(y\) with values \(1\) or \(-1\).
nn.MultiMarginLoss() Creates a criterion that optimizes a multi-class classification hinge loss (margin-based loss) between input \(x\) (a 2D mini-batch Tensor) and output \(y\) (which is a 1D tensor of target class indices, \(0 \leq y \leq \text{x.size}(1)-1\)).
nn.TripletMarginLoss() Creates a criterion that measures the triplet loss given an input tensors \(x_1, x_2, x_3\) and a margin with a value greater than \(0\).
  • Full API details are on PyTorch’s torch.nn module page.
  • Here’s a simple example of how to calculate Cross Entropy Loss. Let’s say our model solves a multi-class classification problem with \(C\) labels. Then for a batch of size \(N\), out is a PyTorch Variable of dimension \(N \times C\) that is obtained by passing an input batch through the model.
  • We also have a target Variable of size \(N\), where each element is the class for that example, i.e., a label in \(\text{[0, …, C-1]}\). You can define the loss function and compute the loss as follows:
loss_fn = nn.CrossEntropyLoss()
loss = loss_fn(out, target)
  • PyTorch makes it very easy to extend this and write your own custom loss function. We can write our own Cross Entropy Loss function as below (note the NumPy-esque syntax):
def myCrossEntropyLoss(outputs, labels):
    batch_size = outputs.size()[0]               # batch_size
    outputs = F.log_softmax(outputs, dim=1)      # compute the log of softmax values
    outputs = outputs[range(batch_size), labels] # pick the values corresponding to the labels
    return -torch.sum(outputs)/num_examples
  • This was a fairly trivial example of writing our own loss function. In the section on NLP, we’ll see an interesting use of custom loss functions.


  • The torch.optim package provides an easy to use interface for common optimization algorithms. Torch offers a bunch of in-built optimizers, such as:
Optimizer Description
torch.optim.Adagrad() Proposed in “ADADELTA: An Adaptive Learning Rate Method”.
torch.optim.Adadelta() Proposed in “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization”.
torch.optim.Adam() Proposed in “Adam: A Method for Stochastic Optimization”.
torch.optim.AdamW() A variant of Adam, proposed in “Decoupled Weight Decay Regularization”.
torch.optim.SparseAdam() A lazy version of Adam algorithm suitable for sparse tensors, where only moments that show up in the gradient get updated, and only those portions of the gradient get applied to the parameters.
torch.optim.Adamax() A variant of Adam based on infinity norm, proposed in “Adam: A Method for Stochastic Optimization”.
torch.optim.ASGD() Averaged SGD, proposed in “Acceleration of stochastic approximation by averaging”.
torch.optim.LBFGS() L-BFGS algorithm, heavily inspired by minFunc.
torch.optim.RMSprop() Proposed by G. Hinton in his course. The centered version first appears in “Generating Sequences With Recurrent Neural Networks”.
torch.optim.Rprop() Implements the resilient backpropagation algorithm.
torch.optim.SGD() Implements stochastic gradient descent (optionally with momentum). Nesterov momentum is based on the formula from “On the importance of initialization and momentum in deep learning”.
  • Full API details are on PyTorch’s torch.optim package page.
  • Here’s how you can instantiate your desired optimizer using torch.optim:
# pick an SGD optimizer
optimizer = torch.optim.SGD(model.parameters(), lr = 0.01, momentum=0.9)

# or pick ADAM
optimizer = torch.optim.Adam(model.parameters(), lr = 0.0001)
  • You pass in the parameters of the model that need to be updated on every iteration. You can also specify more complex methods such as per-layer or even per-parameter learning rates.
  • Once gradients have been computed using loss.backward(), calling optimizer.step() updates the parameters as defined by the optimization algorithm.

Training vs. evaluation

  • Before training the model, it is imperative to call model.train(). Likewise, you must call model.eval() before testing the model.
  • This corrects for the differences in dropout, batch normalization during training and testing.

Computing metrics

  • By this stage you should be able to understand most of the code in and (except how we fetch the data, which we’ll come to in the subsequent posts). Apart from keeping an eye on the loss, it is also helpful to monitor other metrics such as accuracy, precision or recall. To do this, you can define your own metric functions for a batch of model outputs in the model/ file.
  • In order to make it easier, we convert the PyTorch Variables into NumPy arrays before passing them into the metric functions.
  • For a multi-class classification problem as set up in the section on loss functions, we can write a function to compute accuracy using NumPy as:
def accuracy(out, labels):
    outputs = np.argmax(out, axis=1)
    return np.sum(outputs==labels)/float(labels.size)
  • You can add your own metrics in the model/ file. Once you are done, simply add them to the metrics dictionary:
metrics = { 'accuracy': accuracy,
            # add your own custom metrics,

Saving and loading models

  • We define utility functions to save and load models in To save your model, call:
state = {'epoch': epoch + 1,
        'state_dict': model.state_dict(),
        'optim_dict' : optimizer.state_dict()}
                      is_best=is_best,      # True if this is the model with best metrics
                      checkpoint=model_dir) # path to folder
  • internally uses the, filepath) method to save the state dictionary that is defined above. You can add more items to the dictionary, such as metrics. The model.state_dict() stores the parameters of the model and optimizer.state_dict() stores the state of the optimizer (such as per-parameter learning rate).

  • To load the saved state from a checkpoint, you may use:

utils.load_checkpoint(restore_path, model, optimizer)
  • The optimizer argument is optional and you may choose to restart with a new optimizer. load_checkpoint internally loads the saved checkpoint and restores the model weights and the state of the optimizer.

Using the GPU

  • Interspersed through the code you will find lines such as:
model = net.Net(params).cuda() if params.cuda else net.Net(params)

if params.cuda:
    batch_data, batch_labels = batch_data.cuda(), batch_labels.cuda()
  • PyTorch makes the use of the GPU explicit and transparent using these commands. Calling .cuda() on a model/Tensor/Variable sends it to the GPU. In order to train a model on the GPU, all the relevant parameters and Variables must be sent to the GPU using .cuda().

Painless debugging

  • With its clean and minimal design, PyTorch makes debugging a breeze. You can place breakpoints using import pdb; pdb.set_trace() at any line in your code. You can then execute further computations, examine the PyTorch Tensors/Variables and pinpoint the root cause of the error.

  • That concludes the introduction to the PyTorch code examples. Next, we take upon an example from vision and NLP to understand how we load data and define models specific to each domain.

Vision: Predicting labels from images of hand signs

Goals of this tutorial

  • Learn how to use PyTorch to load image data efficiently.
  • Formulate a convolutional neural network in code.
  • Understand the key aspects of the code well-enough to modify it to suit your needs.

Problem setup

  • We’ll use the SIGNS dataset from The dataset consists of \(1080\) training images and \(120\) test images.
  • Each image from this dataset is a picture of a hand making a sign that represents a number between \(1\) and \(6\). For our particular use-case, we’ll scale down images to size \(64x64\).

Structure of the dataset

  • For the vision example, we will used the SIGNS dataset created for the Coursera Deep Learning Specialization. The dataset is hosted on google drive, download it here.

  • This will download the SIGNS dataset (~1.1 GB) containing photos of hands signs representing numbers between 0 and 5. Here is the structure of the data:

  • The images are named following {label}_IMG_{id}.jpg where the label is in \(\text{[0, 5]}\).

  • Once the download is complete, move the dataset into the data/SIGNS folder. Run python which will resize the images to size \((64, 64)\). The new resized dataset will be located by default in data/64x64_SIGNS.

Creating a PyTorch dataset

  • provides some nifty functionality for loading data. We use, which is an abstract class representing a dataset. To make our own SIGNSDataset class, we need to inherit the Dataset class and override the following methods:
    • __len__: so that len(dataset) returns the size of the dataset
    • __getitem__: to support indexing using dataset[i] to get the ith image
  • We then define our class as below:
from PIL import Image
from import Dataset, DataLoader

class SIGNSDataset(Dataset):
    def __init__(self, data_dir, transform):      
        #store filenames
        self.filenames = os.listdir(data_dir)
        self.filenames = [os.path.join(data_dir, f) for f in self.filenames]

    #the first character of the filename contains the label
    self.labels = [int(filename.split('/')[-1][0]) for filename in self.filenames]
    self.transform = transform

def __len__(self):
    #return size of dataset
    return len(self.filenames)

def __getitem__(self, idx):
    #open image, apply transforms and return with label
    image =[idx])  # PIL image
    image = self.transform(image)
    return image, self.labels[idx]
  • Notice that when we return an image-label pair using __getitem__ we apply a tranform on the image. These transformations are a part of the torchvision.transforms package, that allow us to manipulate images easily. Consider the following composition of multiple transforms:
train_transformer = transforms.Compose([
    transforms.Resize(64),              # resize the image to 64x64 
    transforms.RandomHorizontalFlip(),  # randomly flip image horizontally
    transforms.ToTensor()])             # transform it into a PyTorch Tensor
  • When we apply self.transform(image) in __getitem__, we pass it through the above transformations before using it as a training example. The final output is a PyTorch Tensor. To augment the dataset during training, we also use the RandomHorizontalFlip transform when loading the image.
  • We can specify a similar eval_transformer for evaluation without the random flip. To load a Dataset object for the different splits of our data, we simply use:
train_dataset = SIGNSDataset(train_data_path, train_transformer)
val_dataset = SIGNSDataset(val_data_path, eval_transformer)
test_dataset = SIGNSDataset(test_data_path, eval_transformer)

Loading data batches

  • provides an iterator that takes in a Dataset object and performs batching, shuffling and loading of the data. This is crucial when images are big in size and take time to load. In such cases, the GPU can be left idling while the CPU fetches the images from file and then applies the transforms.
  • In contrast, the DataLoader class (using multiprocessing) fetches the data asynchronously and prefetches batches to be sent to the GPU. Initializing the DataLoader is quite easy:
train_dataloader = DataLoader(SIGNSDataset(train_data_path, train_transformer), 
                   batch_size=hyperparams.batch_size, shuffle=True,
  • We can then iterate through batches of examples as follows:
for train_batch, labels_batch in train_dataloader:
    # wrap Tensors in Variables
    train_batch, labels_batch = Variable(train_batch), Variable(labels_batch)

    # pass through model, perform backpropagation and updates
    output_batch = model(train_batch)
  • Applying transformations on the data loads them as PyTorch Tensors. We wrap them in PyTorch Variables before passing them into the model. The for loop ends after one pass over the data, i.e., after one epoch. It can be reused again for another epoch without any changes. We can use similar data loaders for validation and test data.
  • To read more on splitting the dataset into train/dev/test, see our tutorial on splitting datasets.

Convolutional network model

  • Now that we’ve figured out how to load our images, let’s have a look at the pièce de résistance – the CNN model. As mentioned in the section on tensors and variables, we first define the components of our model, followed by its functional form. Let’s have a look at the __init__ function for our model that takes in a 3x64x64 image:
import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):
    def __init__(self):
        #we define convolutional layers 
        self.conv1 = nn.Conv2d(in_channels = 3, out_channels = 32, kernel_size = 3, strid = 1, padding = 1)
        self.bn1 = nn.BatchNorm2d(32)
        self.conv2 = nn.Conv2d(in_channels = 32, out_channels = 64, kernel_size = 3, stride = 1, padding = 1)
        self.bn2 = nn.BatchNorm2d(64)
        self.conv3 = nn.Conv2d(in_channels = 64, in_channels = 128, kernel_size = 3, stride  1, padding = 1)
        self.bn3 = nn.BatchNorm2d(128)

    #2 fully connected layers to transform the output of the convolution layers to the final output
    self.fc1 = nn.Linear(in_features = 8*8*128, out_features = 128)
    self.fcbn1 = nn.BatchNorm1d(128)
    self.fc2 = nn.Linear(in_features = 128, out_features = 6)       
    self.dropout_rate = hyperparams.dropout_rate
  • The first parameter to the convolutional filter nn.Conv2d is the number of input channels, the second is the number of output channels, and the third is the size of the square filter (3x3 in this case). Similarly, the batch normalisation layer takes as input the number of channels for 2D images and the number of features in the 1D case. The fully connected Linear layers take the input and output dimensions.

  • In this example, we explicitly specify each of the values. In order to make the initialisation of the model more flexible, you can pass in parameters such as image size to the __init__ function and use that to specify the sizes. You must be very careful when specifying parameter dimensions, since mismatches will lead to errors in the forward propagation. Let’s now look at the forward propagation:

def forward(self, s):
    #we apply the convolution layers, followed by batch normalisation, 
    #maxpool and relu x 3
    s = self.bn1(self.conv1(s))        # batch_size x 32 x 64 x 64
    s = F.relu(F.max_pool2d(s, 2))     # batch_size x 32 x 32 x 32
    s = self.bn2(self.conv2(s))        # batch_size x 64 x 32 x 32
    s = F.relu(F.max_pool2d(s, 2))     # batch_size x 64 x 16 x 16
    s = self.bn3(self.conv3(s))        # batch_size x 128 x 16 x 16
    s = F.relu(F.max_pool2d(s, 2))     # batch_size x 128 x 8 x 8

    #flatten the output for each image
    s = s.view(-1, 8*8*128)  # batch_size x 8*8*128

    #apply 2 fully connected layers with dropout
    s = F.dropout(F.relu(self.fcbn1(self.fc1(s))), 
    p=self.dropout_rate,    # batch_size x 128
    s = self.fc2(s)                                     # batch_size x 6

    return F.log_softmax(s, dim=1)
  • We pass the image through 3 layers of conv > bn > max_pool > relu, followed by flattening the image and then applying 2 fully connected layers. In flattening the output of the convolution layers to a single vector per image, we use s.view(-1, 8*8*128). Here the size -1 is implicitly inferred from the other dimension (batch size in this case). The output is a log_softmax over the 6 labels for each example in the batch. We use log_softmax since it is numerically more stable than first taking the softmax and then the log.

  • And that’s it! We use an appropriate loss function (Negative Loss Likelihood, since the output is already softmax-ed and log-ed) and train the model as discussed in the previous post. Remember, you can set a breakpoint using import pdb; pdb.set_trace() at any place in the forward function, examine the dimensions of variables, tinker around and diagnose what’s wrong. That’s the beauty of PyTorch :).


NLP: Named Entity Recognition (NER) tagging

Goals of this tutorial

  • Learn how to use PyTorch to load sequential data.
  • Define a recurrent neural network that operates on text (or more generally, sequential data).
  • Understand the key aspects of the code well-enough to modify it to suit your needs

Problem setup

  • We explore the problem of Named Entity Recognition (NER) tagging of sentences.
  • The task is to tag each token in a given sentence with an appropriate tag such as Person, Location, etc.
John   lives in New   York
  • Our dataset will thus need to load both the sentences and labels. We will store those in 2 different files, a sentence.txt file containing the sentences (one per line) and a labels.txt containing the labels. For example:
# sentences.txt
John lives in New York
Where is John ?
# labels.txt
  • Here we assume that we ran the script that creates a vocabulary file in our /data directory. Running the script gives us one file for the words and one file for the labels. They will contain one token per line. For instance
# words.txt



Structure of the dataset

  • Download the original version on the Kaggle website.

  • Download the dataset: ner_dataset.csv on Kaggle and save it under the nlp/data/kaggle directory. Make sure you download the simple version ner_dataset.csv and NOT the full version ner.csv.

  • Build the dataset: Run the following script:

  • It will extract the sentences and labels from the dataset, split it into train / test / dev and save it in a convenient format for our model. Here is the structure of the data
  • If this errors out, check that you downloaded the right file and saved it in the right directory. If you have issues with encoding, try running the script with Python 2.7.

  • Build the vocabulary: For both datasets, data/small and data/kaggle you need to build the vocabulary, with:

python --data_dir  data/small


python --data_dir data/kaggle

Loading text data

  • In NLP applications, a sentence is represented by the sequence of indices of the words in the sentence. For example if our vocabulary is {'is':1, 'John':2, 'Where':3, '.':4, '?':5} then the sentence “Where is John ?” is represented as [3,1,2,5]. We read the words.txt file and populate our vocabulary:
vocab = {}
with open(words_path) as f:
    for i, l in enumerate(
        vocab[l] = i
  • In a similar way, we load a mapping tag_map from our labels from tags.txt to indices. Doing so gives us indices for labels in the range \(\text{[0, 1, …, NUM_TAGS-1]}\).

  • In addition to words read from English sentences, words.txt contains two special tokens: an UNK token to represent any word that is not present in the vocabulary, and a PAD token that is used as a filler token at the end of a sentence when one batch has sentences of unequal lengths.

  • We are now ready to load our data. We read the sentences in our dataset (either train, validation or test) and convert them to a sequence of indices by looking up the vocabulary:

train_sentences = []        
train_labels = []

with open(train_sentences_file) as f:
    for sentence in
        #replace each token by its index if it is in vocab
        #else use index of UNK
        s = [vocab[token] if token in self.vocab 
            else vocab['UNK']
            for token in sentence.split(' ')]

with open(train_labels_file) as f:
    for sentence in
        #replace each label by its index
        l = [tag_map[label] for label in sentence.split(' ')]
  • We can load the validation and test data in a similar fashion.

Preparing a Batch

  • This is where it gets fun. When we sample a batch of sentences, not all the sentences usually have the same length. Let’s say we have a batch of sentences batch_sentences that is a Python list of lists, with its corresponding batch_tags which has a tag for each token in batch_sentences. We convert them into a batch of PyTorch Variables as follows:
#compute length of longest sentence in batch
batch_max_len = max([len(s) for s in batch_sentences])

#prepare a numpy array with the data, initializing the data with 'PAD' 
#and all labels with -1; initializing labels to -1 differentiates tokens 
#with tags from 'PAD' tokens
batch_data = vocab['PAD']*np.ones((len(batch_sentences), batch_max_len))
batch_labels = -1*np.ones((len(batch_sentences), batch_max_len))

#copy the data to the numpy array
for j in range(len(batch_sentences)):
    cur_len = len(batch_sentences[j])
    batch_data[j][:cur_len] = batch_sentences[j]
    batch_labels[j][:cur_len] = batch_tags[j]

#since all data are indices, we convert them to torch LongTensors
batch_data, batch_labels = torch.LongTensor(batch_data), torch.LongTensor(batch_labels)

#convert Tensors to Variables
batch_data, batch_labels = Variable(batch_data), Variable(batch_labels)
  • A lot of things happened in the above code. We first calculated the length of the longest sentence in the batch. We then initialized NumPy arrays of dimension (num_sentences, batch_max_len) for the sentence and labels, and filled them in from the lists.
  • Since the values are indices (and not floats), PyTorch’s Embedding layer expects inputs to be of the Long type. We hence convert them to LongTensor.

  • After filling them in, we observe that the sentences that are shorter than the longest sentence in the batch have the special token PAD to fill in the remaining space. Moreover, the PAD tokens, introduced as a result of packaging the sentences in a matrix, are assigned a label of -1. Doing so differentiates them from other tokens that have label indices in the range \(\text{[0, 1, …, NUM_TAGS-1]}\). This will be crucial when we calculate the loss for our model’s prediction, and we’ll come to that in a bit.

  • In our code, we package the above code in a custom data_iterator function. Hyperparameters are stored in a data structure called “params”. We can then use the generator as follows:
#train_data contains train_sentences and train_labels
#params contains batch_size
train_iterator = data_iterator(train_data, params, shuffle=True)    

for _ in range(num_training_steps):
    batch_sentences, batch_labels = next(train_iterator)

    #pass through model, perform backpropagation and updates
    output_batch = model(train_batch)

Recurrent network model

  • Now that we have figured out how to load our sentences and tags, let’s have a look at the Recurrent Neural Network model. As mentioned in the section on tensors and variables, we first define the components of our model, followed by its functional form. Let’s have a look at the __init__ function for our model that takes in (batch_size, batch_max_len) dimensional data:
import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):
    def __init__(self, params):
        super(Net, self).__init__()

    #maps each token to an embedding_dim vector
    self.embedding = nn.Embedding(params.vocab_size, params.embedding_dim)

    #the LSTM takens embedded sentence
    self.lstm = nn.LSTM(params.embedding_dim, params.lstm_hidden_dim, batch_first=True)

    #fc layer transforms the output to give the final output layer
    self.fc = nn.Linear(params.lstm_hidden_dim, params.number_of_tags)
  • We use an LSTM for the recurrent network. Before running the LSTM, we first transform each word in our sentence to a vector of dimension embedding_dim. We then run the LSTM over this sentence. Finally, we have a fully connected layer that transforms the output of the LSTM for each token to a distribution over tags. This is implemented in the forward propagation function:
def forward(self, s):
    #apply the embedding layer that maps each token to its embedding
    s = self.embedding(s)   # dim: batch_size x batch_max_len x embedding_dim

    #run the LSTM along the sentences of length batch_max_len
    s, _ = self.lstm(s)     # dim: batch_size x batch_max_len x lstm_hidden_dim                

    #reshape the Variable so that each row contains one token
    s = s.view(-1, s.shape[2])  # dim: batch_size*batch_max_len x lstm_hidden_dim

    #apply the fully connected layer and obtain the output for each token
    s = self.fc(s)          # dim: batch_size*batch_max_len x num_tags

    return F.log_softmax(s, dim=1)   # dim: batch_size*batch_max_len x num_tags
  • The embedding layer augments an extra dimension to our input which then has shape (batch_size, batch_max_len, embedding_dim). We run it through the LSTM which gives an output for each token of length lstm_hidden_dim. In the next step, we open up the 3D Variable and reshape it such that we get the hidden state for each token, i.e., the new dimension is (batch_size*batch_max_len, lstm_hidden_dim). Here the -1 is implicitly inferred to be equal to batch_size*batch_max_len. The reason behind this reshaping is that the fully connected layer assumes a 2D input, with one example along each row.

  • After the reshaping, we apply the fully connected layer which gives a vector of NUM_TAGS for each token in each sentence. The output is a log_softmax over the tags for each token. We use log_softmax since it is numerically more stable than first taking the softmax and then the log.

  • All that is left is to compute the loss. But there’s a catch- we can’t use a torch.nn.loss function straight out of the box because that would add the loss from the PAD tokens as well. Here’s where the power of PyTorch comes into play- we can write our own custom loss function!

Writing a custom loss function

  • In the section on loading data batches, we ensured that the labels for the PAD tokens were set to -1. We can leverage this to filter out the PAD tokens when we compute the loss. Let us see how:
def loss_fn(outputs, labels):
    #reshape labels to give a flat vector of length batch_size*seq_len
    labels = labels.view(-1)  

    #mask out 'PAD' tokens
    mask = (labels >= 0).float()

    #the number of tokens is the sum of elements in mask
    num_tokens = int(torch.sum(mask).data[0])

    #pick the values corresponding to labels and multiply by mask
    outputs = outputs[range(outputs.shape[0]), labels]*mask

    #cross entropy loss for all non 'PAD' tokens
    return -torch.sum(outputs)/num_tokens
  • The input labels has dimension (batch_size, batch_max_len) while outputs has dimension (batch_size*batch_max_len, NUM_TAGS). We compute a mask using the fact that all PAD tokens in labels have the value -1. We then compute the Negative Log Likelihood Loss (remember the output from the network is already softmax-ed and log-ed!) for all the non PAD tokens. We can now compute derivates by simply calling .backward() on the loss returned by this function.

  • Remember, you can set a breakpoint using import pdb; pdb.set_trace() at any place in the forward function, loss function or virtually anywhere and examine the dimensions of the Variables, tinker around and diagnose what’s wrong. That’s the beauty of PyTorch :).

Model summary

  • Printing the model prints a summary of the model including the different layers involved and their specifications.
from torchvision import models
model = models.vgg16()
  • The output in this case would be something as follows:
  (features): Sequential (
    (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU (inplace)
    (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU (inplace)
    (4): MaxPool2d (size=(2, 2), stride=(2, 2), dilation=(1, 1))
    (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (6): ReLU (inplace)
    (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (8): ReLU (inplace)
    (9): MaxPool2d (size=(2, 2), stride=(2, 2), dilation=(1, 1))
    (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU (inplace)
    (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (13): ReLU (inplace)
    (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (15): ReLU (inplace)
    (16): MaxPool2d (size=(2, 2), stride=(2, 2), dilation=(1, 1))
    (17): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (18): ReLU (inplace)
    (19): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (20): ReLU (inplace)
    (21): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (22): ReLU (inplace)
    (23): MaxPool2d (size=(2, 2), stride=(2, 2), dilation=(1, 1))
    (24): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (25): ReLU (inplace)
    (26): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (27): ReLU (inplace)
    (28): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (29): ReLU (inplace)
    (30): MaxPool2d (size=(2, 2), stride=(2, 2), dilation=(1, 1))
  (classifier): Sequential (
    (0): Dropout (p = 0.5)
    (1): Linear (25088 -> 4096)
    (2): ReLU (inplace)
    (3): Dropout (p = 0.5)
    (4): Linear (4096 -> 4096)
    (5): ReLU (inplace)
    (6): Linear (4096 -> 1000)
  • To get the representation tf.keras offers, use the pytorch-summary package. This contains a lot more details of the model, including:
    • Name and type of all layers in the model.
    • Output shape for each layer.
    • Number of weight parameters of each layer.
    • The total number of trainable and non-trainable parameters of the model.
    • In addition, also offers the following bits not in the Keras summary:
      • Input size (MB)
      • Forward/backward pass size (MB)
      • Params size (MB)
      • Estimated Total Size (MB)
from torchvision import models
from torchsummary import summary

# Example for VGG16
vgg = models.vgg16()
summary(vgg, (3, 224, 224))
  • The output in this case would be something as follows:
        Layer (type)               Output Shape         Param #
            Conv2d-1         [-1, 64, 224, 224]           1,792
              ReLU-2         [-1, 64, 224, 224]               0
            Conv2d-3         [-1, 64, 224, 224]          36,928
              ReLU-4         [-1, 64, 224, 224]               0
         MaxPool2d-5         [-1, 64, 112, 112]               0
            Conv2d-6        [-1, 128, 112, 112]          73,856
              ReLU-7        [-1, 128, 112, 112]               0
            Conv2d-8        [-1, 128, 112, 112]         147,584
              ReLU-9        [-1, 128, 112, 112]               0
        MaxPool2d-10          [-1, 128, 56, 56]               0
           Conv2d-11          [-1, 256, 56, 56]         295,168
             ReLU-12          [-1, 256, 56, 56]               0
           Conv2d-13          [-1, 256, 56, 56]         590,080
             ReLU-14          [-1, 256, 56, 56]               0
           Conv2d-15          [-1, 256, 56, 56]         590,080
             ReLU-16          [-1, 256, 56, 56]               0
        MaxPool2d-17          [-1, 256, 28, 28]               0
           Conv2d-18          [-1, 512, 28, 28]       1,180,160
             ReLU-19          [-1, 512, 28, 28]               0
           Conv2d-20          [-1, 512, 28, 28]       2,359,808
             ReLU-21          [-1, 512, 28, 28]               0
           Conv2d-22          [-1, 512, 28, 28]       2,359,808
             ReLU-23          [-1, 512, 28, 28]               0
        MaxPool2d-24          [-1, 512, 14, 14]               0
           Conv2d-25          [-1, 512, 14, 14]       2,359,808
             ReLU-26          [-1, 512, 14, 14]               0
           Conv2d-27          [-1, 512, 14, 14]       2,359,808
             ReLU-28          [-1, 512, 14, 14]               0
           Conv2d-29          [-1, 512, 14, 14]       2,359,808
             ReLU-30          [-1, 512, 14, 14]               0
        MaxPool2d-31            [-1, 512, 7, 7]               0
           Linear-32                 [-1, 4096]     102,764,544
             ReLU-33                 [-1, 4096]               0
          Dropout-34                 [-1, 4096]               0
           Linear-35                 [-1, 4096]      16,781,312
             ReLU-36                 [-1, 4096]               0
          Dropout-37                 [-1, 4096]               0
           Linear-38                 [-1, 1000]       4,097,000
Total params: 138,357,544
Trainable params: 138,357,544
Non-trainable params: 0
Input size (MB): 0.57
Forward/backward pass size (MB): 218.59
Params size (MB): 527.79
Estimated Total Size (MB): 746.96