Udacity Nanodegree Capstone — Dog Breed Classification

Parveen Khurana
9 min readApr 15, 2022
The image is taken from cdc.gov

Project Overview / Problem Introduction

As part of this project, a Convolutional Neural networks (CNNs) is trained that processes real-world, user-supplied images to identify if it contains a dog or a human face and then predict the closest breed the dog or the human face resembles with.

Typically, CNNs are deep neural networks that require a massive amount of data and therefore compute/training time. In order to reach an acceptable level of accuracy, a pre-trained model could be leveraged to reduce the training time and at the same time, leverage the learned weights/parameters of this model. This process comes under the window of transfer learning.

Project Goal

The intent is to train a dog breed classifier algorithm that detects if the input image contains a dog face or a human face, predicts the breed in the case of a dog’s face, and gives the resembling dog breed in the case of a human face as the input.

The threshold to accept the model/algorithm would be to achieve at least 60% accuracy on the test set.

Strategy to solve the problem

In the jupyter notebook template that Udacity provides, the strategy for solving this problem is as follows:

  • Step 0: Import Datasets
  • Step 1: Detect Humans
  • Step 2: Detect Dogs
  • Step 3: Create a CNN to Classify Dog Breeds (from Scratch)
  • Step 4: Use a CNN to Classify Dog Breeds (using Transfer Learning)
  • Step 5: Create a CNN to Classify Dog Breeds (using Transfer Learning)
  • Step 6: Write your Algorithm
  • Step 7: Test Your Algorithm

Before we deep dive into each of the steps, let’s quickly take a look at the metrics to evaluate the algorithm.


The metric used to evaluate the results, in this case, is the “accuracy” metric.

Here is a quick description of how accuracy is computed in machine learning: the trained model is used to make the predictions, and the predictions are matched with the ground truth (correct labels), the fraction of images/instances that are correctly identified by the model is termed as the accuracy.

Exploratory Data Analysis

The Udacity team has provided ~8300 labeled dog images where the label is one of the 133 possible dog breeds. The data is split into the train, test, and validation sets.

As this is a classification problem, it would make sense to see the distribution of the 133 classes/breeds

The snippet above tells that the dataset is not perfectly balanced, but the good thing is that even the lowest represented breed has at least a decent number (greater than 25) of training examples


Step 0: Import the Dog dataset

The code snippet above imports a dataset of dog images. We use the “load_files” function of the scikit-learn library to populate a few variables:

  • train_files, valid_files, test_files - NumPy arrays containing file paths to images
  • train_targets, valid_targets, test_targets - NumPy arrays containing one hot-encoded classification labels
  • dog_names - list of string-valued dog breed names for translating labels

Step 1: Detect Humans

This code cell demonstrates how to use this detector (pre-trained detectors from open cv) to find human faces in a given image

In order to use any of the face detectors, the images must first be converted to grayscale. The detectMultiScale function executes the classifier stored in face_cascade and takes the grayscale image as a parameter.

In the above code, faces is a NumPy array of detected faces, where each row corresponds to a detected face.

Each detected face is a 1D array with four entries that specifies the bounding box of the detected face.

The first two entries in the array (extracted in the above code as x and y) specify the horizontal and vertical positions of the top left corner of the bounding box. The last two entries in the array (extracted here as w and h) specify the width and height of the box.

Step 2: Detect Dogs

When using TensorFlow as the backend, Keras CNNs require a 4D array (which we’ll also refer to as a 4D tensor) as input, with the following shape:



nb_samples corresponds to the total number of images (or samples), and rows, columns, and channels correspond to the number of rows, columns, and channels for each image, respectively

The path_to_tensor the function in the snippet above takes a string-valued file path to a color image as input and returns a 4D tensor suitable for supplying to a Keras CNN.

  • It first loads the image and resizes it to a square image that is 224×224 pixels.
  • The image is then converted to an array, which is then resized to a 4D tensor.
  • In this case, since we are dealing with color images, each image has three channels (RGB)
  • Likewise, since we are processing a single image (or sample), the returned tensor will always have a shape: “(1,224,224,3)

The paths_to_tensor the function takes a NumPy array of string-valued image paths as input and returns a 4D tensor with the shape:


Here, nb_samples is the number of samples, or the number of images, in the supplied array of image paths.

Think of nb_samples as the number of 3D tensors (where each 3D tensor corresponds to a different image) in the dataset.

Step 3: Create a CNN to classify Dog breeds

Now that there are functions for detecting humans and dogs in images, we need a way to predict breeds from images. In this step, images are pre-processed and a CNN architecture is defined to be trained on the dataset

The intent here would be to just get at least 1% accuracy on the test set, as there are a possible 133 breeds/categories in the Image Net dataset and a random guess will provide a correct answer roughly 1 in 133 times, which corresponds to an accuracy of less than 1%.

Model architecture:

There are 3 convolutional layers and 3 layers of max-pooling in this architecture followed by an average pooling layer and a dense layer at the end

Kernel size of 2 is used and the “ReLU” activation is used in all of the convolutional layers

The dense layer at the end is used to create a fully connected neural network

Train the model

The above snippet enlists the details about the model training piece, model checkpointing is used to save the model that attains the best validation loss.

Load the model with the best validation loss

Test the model

Step 4: Use a CNN to Classify Dog Breeds (using Transfer Learning)

To reduce training time without sacrificing accuracy, the Udacity team has provided a snippet to show how to train a CNN using transfer learning.

In the following step, you will get a chance to use transfer learning to train your own CNN.

A VGG-16 trained model is used and on top of it, a global average pooling layer and a fully connected layer are added

The same process is repeated as in step 3 to train the model, store the model with the best validation loss, and it’s loaded again and tested on the test set

With the transfer learning and learned weights in place, and accuracy of ~39% has been achieved on the test set

Step 5: Create a CNN to Classify Dog Breeds (using Transfer Learning)

This is the same as the step except for the part that a different pre-trained model is used in this one, and in step 4, sample code was already provided by the team to demonstrate an example of how to work with the transfer learning

I’ve used the trained ResNet model for this step

The model is compiled, trained, the best model basis the validation loss is saved, and is loaded in memory and tested on the test set

Test accuracy of 81% has been achieved with this model

A function is then defined to use this trained model and make predictions on the new input:

Step 6: Write your own algorithm

Here the intent is to make the predictions for new input. The input is first passed through the dog detector (defined in step 2) to see if the input indeed contains a dog, if not, it is passed through the human face detector (defined in step 1) to check if it contains a human face

Post the dog/human identification, it’s breed or the closest breed (in the case of a human face as the input) is predicted

Step 7: Test your Algorithm

A few images are run through the algorithm to see its performance. Here are the snippets from the test runs:

Here the input image is a human face and it is correctly identified by the algorithm

Here again, the input image is a human face and it is correctly identified by the algorithm

For input image 3, the input is not either a dog or a human but it’s a cat, and the same is identified correctly by the algorithm

For input image 4, the input is a dog and the same is identified correctly by the algorithm

And for the input image 5, 6 — both the inputs contain a dog face which is correctly identified by the algorithm


The following piece of information collates together the test set accuracy achieved using three models:

  • CNN from scratch: ~1%
  • CNN from VGG16 (using transfer learning): ~40%
  • CNN from ResNet50 (using transfer learning): ~80%

The difference in the accuracy of the three models is because of the different architecture, and the difference in the total number of parameters in each approach. Typically, more number of parameters would increase the accuracy but up to a certain point beyond which it might then try to overlearn from the input and would take the direction of overfitting and then might take a toll on the accuracy as well.

In this case, with each advanced model, the accuracy has increased which helps us establish that the overfitting of the model is not the case here.


Deep Learning has really accelerated the intelligence of the systems and is now penetrating almost every field.

In this project, the idea was to use a convolution neural network to train a classifier that would identify if the input image contains a dog or a human face and predicts the closest breed

Test accuracy of ~80% has been achieved leveraging transfer learning with ~60 seconds of training time and this could further be improved by augmenting the dataset, maybe trying even more complex architecture that would offer more number of parameters and allow to learn even complex patterns.

Suggested Improvements

The following techniques could be used to gauge the impact it creates on the model training, accuracy:

  • Augmenting the dataset
  • Experimenting with the Neural Network architecture: kernel size, number of layers, type of pooling, and so on