Fashion-MNIST

This project takes an in-depth look into machine learning techniques and how to use pytorch -- for beginners. The data set of choice is the well-known Fashion-MNIST data set, which is composed of many images of fashion items (shirts, shoes, etc.). The data is labeled, meaning for every image we know exactly what it should be (an image of a shirt is given the shirt label). We take this from the beginning, as if pytorch has never been used before and you are just getting started.

Machine Learning Basics

We will be building a supervised learning machine learning model. Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. We will be training the model with labeled data and testing the model by how well it predicts the correct label.

Set up Python Virtual Environment

Some things that you need to install in your python environment if you do not have them already:

pytorch
numpy
matplotlib

I am running this on python 3.9. Using anaconda, I created a virtual environment (named py39-pytorch, you can name it whatever you'd like) and I installed all dependencies as such:

conda create -n py39-pytorch python=3.6
conda activate py39-pytorch
conda install pytorch torchvision -c pytorch 
conda install numpy
conda install matplotlib

Now that dependencies are likely installed, lets import all the dependencies

Getting Aquatinted with the Data

Download the dataset from torchvision, which is a library that has the FashionMNIST dataset, among many other popular data sets. Provided is a training set and a testing set that we can grab. We specify the location to store the data, which is a directory named 'data' that will be located within the current working directory.

We are going to ultimately have a training set, a validation set, and a testing set. The training set is the set of data we will use to train our model and the validation set will be used to validate the model. As we train/validate, there is potential to over-fit the model to the data set using to train/validate. That is where the testing set comes in handy. Of course, in machine learning, this is always a battle. Depending on the size of your dataset, it is possible to not have the luxury of having so many samples to train/validate/test. In any case, the training/validation sets will be made up of the 'train' set from torch vision and the test set will be made up of the 'test' set from torch vision, and we will split the 'train' set into our training/validation sets.

The first step is to split the data into a train and validation set. The test set will be used to test the accuracy of the final results using the validation to assess accuracy originally. Pytortch is able to slpit data using the function:

torch.utils.data.dataset.random_split(dataset, [num_train, num_valid]).

We will use an 80/20 split here. 80% of the data from the train_and_valid_set will be used for training, and 20% will be used for validation.

The test set, as said previously, is also located within torchvision. We can download this as well, set train = False to get the test dataset.

Each sample is an image, that is a matrix of values that represent the grey-scale image, and an integer label that defines the item of clothing the image is. For clarity, we can access the description (or clothing item) associated with each label.

Lets take a quick look at one data point. Lets just grab a radom data point (I am choosing the third sample here, just to make sure it is always the same when reloading the kernel) from the original data set.

Lets look at the size of the image data:

The shape attribute tells us we have 1 matrix (sometimes in image data we will have 3 matrices representing the 3 channels of an RGB image, but here we are in gray-scale, which is only one channel, hence one matrix) of size 28 x 28. This means there are 784 pixels for the image, or 784 elements making up the data matrix storing the pixel data. Lets look at what the data for an image looks like:

As expected, the image array is a matrix of values. The values are between 0 and 1, where 0 is black and 1 is white, that make up the gray scale image. Sometimes this data set is not normalized between 0 and 1, but the data loader did that for us. Lets actually show that image. We can use the imshow() function to do so. As said previously, there are only 784 pixels, so it is likely the image quality will be relatively poor.

Although blurry, we can clearly see the image is a trouser. Lets look at the integer label for this data point.

Lets take a quick look at more of the data. Lets plot all the images in one batch, which should be a total images = batch size chosen previously. The torchvision.utils.make_grid function is great to show many images at once. We will also use the torch.utils.data.DataLoader to be able to iterate over the data. We are splitting up our data here into groups (or batches) of 50. This will come into play later when we begin training our model but for now we will just use it to show 50 random images.

We can now see all the images in one of the batches. We see 50 different fashion images. This would mean that the variable holding all the labels for the image should be length 50, lets confirm this:

The ultimate goal of this project is to investigating different machine learning techniques and determine the one with the best accuracy, all while also investigating hyperparameters that contribute to the success of the algorithms.

Building a Model

Some important notes:

The batch size is a hyperparameter of gradient descent that controls the number of training samples to work through before the model's internal parameters are updated.

The number of epochs is a hyperparameter of gradient descent that controls the number of complete passes through the training dataset.

The learning rate is the rate at which model parameters are updated at each batch/epoch. Smaller values mean the model learns relatively slow. The learning rate is an essential hyperparameter in machine learning because it dictates how much your model learns from new observations versus how much it desires to disregard the past.

Keep these definitions in mind as we go through this. These are the hypeparameters we will investigate and alter to control the learning and optimization process of our algorithm.

Lets begin by defining a neural network. There are some important concepts to understand to actually create this model, so lets take it slow. With 2D (matrix) data, we first have to flatten the matrix. This makes the input features linear, which provides versatility in training. We can do this using the nn.Flatten() function. To see it in action, lets see how if modifies the matrix of one image.

We can see that when we flatten the matrix, we are left with an array of size 784, which is 28*28. Now that we have flattened data, we can create layers for the neural network. We begin with a linear layer, which in pytorch is nn.Linear which is a module that transforms the data into a linear layer using the feature data given. You also have to specify the number of features you would like to output. This is essentially the number of nodes within that layer. This choice is something you can of course play with. For this example, we are using 20 nodes in this hidden layer.

We can see that when we pass layer1 the input features, which in this case is the flattened image, we will result in an array that is size number of nodes we choose. So in this case, the hidden layer size is 20. When developing the neural network, we also want nonlinear activation functions to map the input and output for each layer. We are going to use a nn.ReLU activation function but can play with others such as nn.PReLU or nn.ELU. These activation functions introduce nonlinearity within the model.

Now we want to order all the layers we will ultimately have in our network. We will order them by first layer, any activation function, then output layer. If you want more than one hidden layer, you can add more layer activation function pairs to the order. An example of a stack of n+1 layers is shown below:

Input = flattened matrix (an array) of your feature data

output = array of size number of classes

Here is an example of a small network with only one hidden layer and an output layer.

Now that we have a miniature network, we can run it on one image just to see what we get.

The output from the small network is an array of size 10, as expected. The values will be what we call logits which are raw values [-infinity, infinity] that ultimately represent the probabilities of class predictions. Although, these are raw values, and we need the nn.Softmax module to get the actual probabilities.

nn.Softmax takes in a dimension, that specifies the axis for which values must sum to 0. Since we have a row-oriented array here -- which is often the case for flattened matrices -- the dimension is 1.

After calling the softmax function for the raw logits, we have an array that shows the probabilities for prediction, scaled to vales [0, 1], showing the predicted probabilities for each class. We would then say the prediction for the image is the class with the greatest probability. The torch.max function which takes in the softmax probabilities, and the dimension, returning the maximum probability and the index of that probability. The index pertains to the class predictions in this case. Of course this is completely random as the model has not been trained, but for the sake of example lets see what the prediction is.

The index with the greatest probability is index 2, which is the output label 'Pullover', although we know, since this is the random image we have been using, it is actually a T-shirt.

Lets go ahead and build a neural network using what we have learned!

Alright! We have a neural network. Now what? Well we have to train the network. We also want to make sure we can do things like assess the performance of the network. We are going to choose a loss function, which in this case we will choose nn.CrossEntropyLoss(). This loss function is commonly used because it is convex and easily optimized. The goal of course is to minimize the error between the model and the predictions, and being able to minimize (optimize) the error with ease is essential for model performance. A lot of the work in machine learning models is within the gradient descent, where loss is optimized. The nn.CrossEntropyLoss() function actually computes the nn.Softmax() so we need to pass it the raw logit values. This is why the neural network designed above does not calculate the nn.Softmax.

For further clarification and knowledge, there are a variety of loss functions out there and you should choose one based on the data and classification you are attempting. nn.MSELoss, which is the mean squared error, is commonly used for regression. You could also use nn.NLLLoss, which is the Negative Log Likelihood, which is commonly used for classification. As stated previously, nn.CrossEntropyLoss is often used for neural networks, and combines nn.LogSoftMax and nn.NLLLoss.

An example to interpret nn.CrossEntropyLoss(): a loss of 0.22 means that, in average, your model is assigning the correct class a probability around 80% (remember the cross entropy loss for a single sample is $-\log(\hat{y}_y)$.

We can play around by creating the loss function and finding the loss on the random_image we have been dealing with. We will pass the entropy loss function the logits computed previously and the torch.LongTensor version of the random_image_label.

Now that we have an idea about what the loss function is doing, we need try to optimize the loss. There are a variety of optimizers that can be used with pytorch. Check out the optim package to see all the optimizers available. This choice is sometimes hard. There is a simple stochastic gradient descent optimizer optim.SGD, or optim.ADAM (very popular for the fashionMNIST dataset) and optim.RMPprop.

When you begin to optimize the loss function, it is now the time to take into account the model parameters you will be using. The optimizer you choose must be initialized with these model parameters, such as the learning rate. For this initial example, we will be using a learning rate of 0.003. The model, which is type NeuralNetwork (the class we created previously) has a method in the base nn.Module named parameters() that access these parameters for us, to send to the optimization algorithm.

The data downloaded from tortchvision is now stored in the data directory located within the current working directory, we can now load it within the notebook using the dataloder. The data is a large set of images, which is important to keep in mind. In this case, I have loaded up the data in batches, which is a hyper parameter for the machine learning process. The batch size can be changed if desired. The batch size will be discussed more later when we begin to develop the model. I am showing the data being loaded in this way for the sake of example, although it will be changed when we begin to train the model and find the optimal hyper parameters.

Remember the training set has 48000 data points. This is what we have split into batches.

We can now train and test how our network is performing. We are going to have a training loop and a validation loop.

Lets go ahead and try it out!

The final model accuracy on the above trained model can be tested on the testing dataset:

We just trained our model! We trained on 10 epochs, meaning we gave our model 10 tries to improve its predictions. With 10 epochs, we achieved 87.8% accuracy on the test set.

At this point, we now know how to build and implement a forward feed fully connect neural network. The question now is, was this random network the best one? We can investigate this using model exploration, which is a common technique in machine learning. The idea is to determine the hyperparameters that maximize the validation error. I go through some of this exploration throughout the rest of this notebook.

Model Exploration

Now that we have an idea about what we are trying to achieve here, lets modify some of our programs, try different networks, and test visualize the results. I am going to modify the train and validation loops to return interesting values for visualization.

Number of Epochs Test

Lets first start by testing how many epochs is optimal for training. The following code is going to train the model for 30 epochs and keep track of the accuracy of prediction at each epoch.

To train this model for 30 epochs, it took about 5 minutes on my local machine (with a batch size of 128). This is a decent amount of time when you want to try new models over and over again. Lets plot the accuracy of the network as a function of epochs to determine how many epochs we should use.

The accuracy of the validation set doesn't really improve after about 10 epochs, so we will stick to that for testing some of the other aspects of the network. Lets now look at how many layers in a fully connected neural network (here with 256 nodes). The new network is defined below, where the number of layers can be chosen. Note, the base model above has 2 hidden layers.

Number of Layers Test

Interestingly, the accuracy drastically decreased as the number of layers increased. Although this makes sense, as we are not adding any additional information by adding many symmetric layers.

Lets try to find the optimal number of nodes. We will take our base network with 2 hidden layers, as this performed the best in the above test.

Number of Nodes Test

We can see that 2048 nodes does the best on the validation set, but the increase from 256, 512, and 1024. 2048 nodes also took a long time to execute, so we will stick to 256 for now.

So far the best choices for the model are:

Lets look at optimization on this structure.

Batch size for the optimizer can be modified. Remember the optimization parameters are updated after each batch.

Batch Size Test

From this, we can see that a batch size of 128 is optimal for validation accuracy, as well as computation time. A batch size of 1 took nearly an hour, which makes sense as we were optimizing every image, since we update after every batch and this had a batch size of 1.

Learning Rate Test

Best LR = 0.0007

A Final Test

Using the best hyperparameters, we can determine how well we did!

This project was built using tutorials in the pytorch doumentation:

https://pytorch.org/tutorials/beginner/basics/buildmodel_tutorial.html

https://pytorch.org/tutorials/beginner/basics/optimization_tutorial.html