Machine Learning Foundations: Part 4 - Coding with Convolutional Neural Networks

Alex Alex 27 May
Machine Learning Foundations: Part 4 - Coding with Convolutional Neural Networks

In the previous part 3 - Convolutions and pooling, you learned all about convolutions and how they can use filters to extract information from images. You also saw how to create pools that can reduce and compress your images without losing the vital information that was extracted by the filters. In this part, you're going to get hands-on and create your own convolutional neural networks, so let's get started.

In earlier articles, for the simple neural network for spotting fashion or handwriting digits, you defined a model architecture like this.

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(128, activation=tf.nn.relu),
  tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])

You use layers and primarily dense layers for densely-connected neurons. To use convolutions and pooling, you have the Conv2D and MaxPooling layers, like this.

model = tf.keras.models.Sequential([
  tf.keras.layers.Conv2D(64, (3,3), activation='relu', input_shape=(28, 28, 1)),
  tf.keras.layers.MaxPooling2D(2, 2),
  tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
  tf.keras.layers.MaxPooling2D(2,2),
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(10, activation='softmax')
])

They can be stacked on top of your dense network. You define a convolutional layer with a number of parameters. In this case, the 64 is the number of filters for this layer.

Remember that the filters will be randomly initialized, and then the best filters to match the pictures to their labels will be learned over time. The 3x3 is the size of the filter. Earlier, we saw filters for the current pixel and its immediate neighbors that were 3 by 3, and that's what we're defining here.

As before, we have an input shape, which is the shape of the images being fed in, and that's 28x28 with 1-byte color depth. Similarly, the pooling is done with a layer, and the 2x2 defines the size of the chunks to pool. So in this case, 4 pixels will become 1.

There's also MinPooling, AveragePooling, and stuff like that, but we'll focus on MaxPooling here. These layers can then be stacked on top of each other, so the results of the 64 filters from the top layer will each be pooled, and then their results will each be filtered 64 times, and they, of course, will get pooled again.

So let's take a look at the model's summary so we can see how the data is changing as it goes through the network.

model.summary()

You'll see something like this.

Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, 26, 26, 64)        640       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 13, 64)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 11, 11, 64)        36928     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
flatten (Flatten)            (None, 1600)              0         
_________________________________________________________________
dense (Dense)                (None, 128)               204928    
_________________________________________________________________
dense_1 (Dense)              (None, 10)                1290      
=================================================================
Total params: 243,786
Trainable params: 243,786
Non-trainable params: 0

There's a lot of going on here, so let's unpack it. First of all, the initial output probably looks weird. Our images are 28x28 and we get 64 filters, so we'd expect our output to be 28x28, but it's 26x26. This looks like a bug, but it isn't, so let me explain why.

Consider a picture like this one of a very sleepy puppy.Machine Learning Foundations: Part 4 - Coding with Convolutional Neural Networks

On the left, I've zoomed into the top left of the picture so you can see the pixels. When doing a filter, you scan every pixel and take its neighbors. But what happens if we pick the top pixel like this?Machine Learning Foundations: Part 4 - Coding with Convolutional Neural Networks

It doesn't have any neighbors above it and it doesn't have any to the left. Similarly, the next pixel doesn't have any neighbors on top, but it does have some on the left. It's not until you get to the pixel that you'll have one that has neighbors on all sides, which you can see here.Machine Learning Foundations: Part 4 - Coding with Convolutional Neural Networks

So a 3-by-3 filter requiring a neighbor on all sides can't work on the pixels around the edges of the picture. You effectively have to remove one pixel from the top, bottom, left and right, and this reduces your dimensions by 2 on each axis. So a 28x28 becomes a 26x26, which you can see here.

Each filter will learn 9 values for the filter coefficients, plus a bias, for a total of 10 parameters. So the 64 filters have 640 learnable parameters.

Our pooling reduces the dimensionality by half on each axis, so 26x26 will become 13xbut no parameters are learned on this layer.

The 3x3 filter then reduces 13x13 to 11x11 by removing a pixel border, like before.

The MaxPooling halves that, rounding down, so we end up with 5x5 images. At this point, we have 64 filters and the images are 5x5, for 25 pixels. Multiply all that out, and you get 1,600, which then gets fed into the Flatten. This set of 1,600 values can then be classified with a dense network, as before.

So now that you've seen how the code works, let's take a look at a code that updates your fashion classifier from last time to use convolutions, as well as dense layer types.

So let's take a look at improving computer vision accuracy using convolutions. Here's the deep neural network that you've created already for the fashion_mmist data set.

import tensorflow as tf
mnist = tf.keras.datasets.fashion_mnist
(training_images, training_labels), (test_images, test_labels) = mnist.load_data()
training_images=training_images / 255.0
test_images=test_images / 255.0
model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(128, activation=tf.nn.relu),
  tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(training_images, training_labels, epochs=5)

test_loss = model.evaluate(test_images, test_labels)

And we can see that we have Flatten(), followed by a Dense with 128 neurons, followed by another Dense with 10 neurons, because we have 10 classes. When I run this, I'm just going to train for five epochs. Let's see how quick it is and let's see how accurate it is.

Epoch 1/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.4966 - accuracy: 0.8262
Epoch 2/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.3742 - accuracy: 0.8649
Epoch 3/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.3378 - accuracy: 0.8751
Epoch 4/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.3145 - accuracy: 0.8848
Epoch 5/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.2958 - accuracy: 0.8905
313/313 [==============================] - 0s 1ms/step - loss: 0.3629 - accuracy: 0.8710

We can see after five epochs, it's up to about 89% accuracy on the test set, and a little over 87 -- almost 88% accuracy on the validation set, which is really, really strong performance, considering it's only been five epochs.

So now let's take a look at what happened with a convolutional neural network. So here you can see the model architecture.

import tensorflow as tf
print(tf.__version__)
mnist = tf.keras.datasets.fashion_mnist
(training_images, training_labels), (test_images, test_labels) = mnist.load_data()
training_images=training_images.reshape(60000, 28, 28, 1)
training_images=training_images / 255.0
test_images = test_images.reshape(10000, 28, 28, 1)
test_images=test_images/255.0
model = tf.keras.models.Sequential([
  tf.keras.layers.Conv2D(64, (3,3), activation='relu', input_shape=(28, 28, 1)),
  tf.keras.layers.MaxPooling2D(2, 2),
  tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
  tf.keras.layers.MaxPooling2D(2,2),
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.summary()
model.fit(training_images, training_labels, epochs=5)
test_loss = model.evaluate(test_images, test_labels)

We have our same Flatten(), Dense(), Dense() that we had earlier. But in this case, on top of that, we have a couple of convolutional layers, and these convolutional layers have their associated MaxPooling layers. Note that the input shape is 28x28x1, because the convolutional layer expects it to be in three dimensions, with one dimension for the color depth. And that means we have to reshape our training images and our test images arrays. They were 68,000 by 28x28. We have to add another dimension onto it-- 10,000 by 28x28x1 for the test images with that extra dimension added onto it.

So now when I run it, it's going to compile. It's going to show me the model architecture. And it's going to start training. 

Epoch 1/5
1875/1875 [==============================] - 80s 43ms/step - loss: 0.4342 - accuracy: 0.8417
Epoch 2/5
1875/1875 [==============================] - 78s 42ms/step - loss: 0.2897 - accuracy: 0.8933
Epoch 3/5
1875/1875 [==============================] - 78s 41ms/step - loss: 0.2446 - accuracy: 0.9086
Epoch 4/5
1875/1875 [==============================] - 77s 41ms/step - loss: 0.2116 - accuracy: 0.9213
Epoch 5/5
1875/1875 [==============================] - 78s 41ms/step - loss: 0.1832 - accuracy: 0.9319
313/313 [==============================] - 4s 12ms/step - loss: 0.2580 - accuracy: 0.9071

And in this case, with only five epochs training, it's gone up to about 93% on the test data and 91% and change on the validation data. So we can see it's actually improved. It's a significant step in the right direction.

Next: Part 5 - Classifying real-world images

Comments (0)

    No comments yet

You must be logged in to comment.

Sign In / Sign Up