I hope you liked the picture of the rubber ducks.

In actuality, it’s pretty crazy how fast the human brain classifies images. We’re so complex, and dare I say, intelligent.

I was sceptical of saying intelligent because this is what majority of my demographic does:

  • plays fortnite
  • vapes

None-the-less, I think it’s amazing how when most humans look at the first picture we reach the same conclusion: rubber ducks.

When it comes to “reaching the same conclusion” the human race often sucks at this, just look at politics.

Yet when it comes to image classification, we almost always reach a consensus.

This picture of simple rubber ducks is actually really complex.

First of all, the yellow ducks are on a yellow shelf, it’s hard to see them. The angle that this image is taken at also causes the sizes to vary. These ducks are made of 4 different components: eyes, body, head and beak. However, in some of the ducks near the back, we can only see their body head and beak. The ducks towards the bottom right corner look like they have a beak and maybe a head. The actual features of the duck vary in size (e.g. head size, beak size/orientation/shape)

This picture has tons of variables. Size, angle, orientation, etc.

But we conclude it’s a bunch of rubber ducks.

When we look at this image, we also see rubber ducks.

Except these ducks look significantly different than the first duckies. Yet we’d all say that both of these images contain rubber ducks.

As a human race, we’ve come to agree what rubber ducks and other items look like (we’re still yet to agree upon a good approach to immigration or gun control, but let’s just admire the fact that we can agree on something)

How do we classify rubber ducks

Well, most rubber ducks are yellow.

However, once in a while we might have a duck look like this:

or this:

So rubber ducks are mostly yellow, but can sometimes be red or blue. They also normally have eyes, a beak and a rounded head. The many images of ducks I’ve exposed you to have many similarities. And although they share some differences like eye colour, skin colour, size, beak shape, etc. We are still able to classify them as rubber ducks.

Imagine having a computer do this. It would be very hard since there are so many variables to what a rubber duck truly is.

However, using artificial intelligence (a field of computer science giving machines the ability to learn like humans), we can replicate the human process to classify images, such as rubber ducks.

Using CNNs to Detect Images

Really creative title, I know.

Let me break some stuff down:

Artificial intelligence (but the cool kids call it AI), is basically a technology that makes machines smart by mimicking human brain structure.

The brain is composed of neurons which transmit signals to each other. We copy this form in AI using artificial neurons.

In the brain, we have a BUNCH of neurons, which form a network. In AI we do this artificially with an artificial neural network.

Copying human intelligence >> artificial intelligence

Coping Neurons in the brain >> artificial neurons

Copying the Brains Neural Networks >> artificial neural networks

See the trend?

We’re copying biological processes artificially. So how can we get a computer to classify images like the human brain? artificial neural networks.

Now, artificial neural networks (or ANNs for short) are just a big umbrella term describing the structure of AI i.e. neural nets. We can’t just use “an ANN” to classify images. That’s like saying you want ice cream. Well just like there are a lot of different ice cream flavours (mint chip, mango, caramel…), there are a lot of ANNs. A full list can be found here.

I’m just gonna cut to the chase. A common neural net for classifying images is a CNN or convolutional neural network.

For more info on the structure of neural networks / how they work read here (scroll down to the section “breaking down how neural networks work”)

How does a CNN work????

Another creative title.

To us humans, the white portion of these images looks like the letter X. But computers take things literally, they wouldn’t think both of these images are the letter X. Since the pixel values aren’t identical, computers would classify these as two different things.

However, CNNs would look at these two images, and rather than matching exact pixel values, this network would compare parts of the image rather than the whole thing. They break down images into features.

In the example of our beloved rubber ducks, we can break them down into features. The beak, body, eyes, head. We can even be more specific. But in a CNN we can have a designated layer to examining all these features in order to classify a duck.

We store a bunch of features in the CNN. If it was for classifying ducks, we would store features like “beak”, “eyes”, “head” and “body”.

Once we have a bunch of features, we would do something called “feature mapping”

Feature mapping is when we observe the image in parts, looking for a specific feature. We do this by using a filter.

We test for multiple features throughout the image, then using some probabilistic math, we determine where the feature occurs. We then display the probability of where a feature is via pixel numbers.

E.g. For the feature (represented by the middle square), in the image (first square), we can determine where the feature is based on probabilities outlined in the third square aka the feature map(where 1.00 means feature is a 100% match, -1.00 means feature is a 0% match). In order to determine the numbers values on the featured map, there is a series of mathematical steps, but that’s an article for another time.

The feature map (surprisingly) maps out where the feature occurs.

We compile a bunch of these feature maps, based on different features, to understand the features in an image.

This process of turning an image + the filter into a filter map is called creating a convolutional layer

We can create a Max Pooling Layer

Now, these feature maps have lots of numbers. Tons of numbers can stress me and the computer out. However, thankfully we can simplify these feature maps!!!!

We do this through Pooling.

No! not that type of pooling

Not this kind either!

Ahhhhh… this kind of pooling!

We use the same flashlight technique, to simplify the feature map.

Using a Max() function, we go through different areas of the map (in the image above we follow a 2x2 method, below we go through 3x3 areas) — this area value can vary. While we go through each area, we look for the largest value. Whatever the largest value is, we record that number on the max pooled layer.

Here’s another GIF to help illustrate that we’re taking the max value from each input section, and putting that value in the output, pooled layer.

The max pooling layer simplifies a feature map and makes it smaller.

We can also create a ReLU layer


ReLU = Rectified Linear Unit


It’s a function that looks like this:

ReLU is a function just like sigmoid, but the math is different.

In ReLU every negative number, no matter its value turns into 0. But every positive number stays the same. E.g. -321 = 0 and 321= 321 or -67 = 0 and 67 = 67.

When we create a ReLU layer, we take the un-simplified (unpooled) feature map and pass the ReLU function. All the negative values turn to 0, all the positive values stay the same. We create a layer with no negative values.

Note* the feature map layer can also be referred to as the convolution layer.

Once we have 3 layers (convolution, pooled, ReLU) we stack them together. This makes the output of one layer the input of the next.

Convolution layer is when we scan a feature across the image to create a feature map. This map (the output) then becomes the input for the next layer, in this case, ReLU. The output of the ReLU layer becomes the input of the pooling layer.

To make our CNN more accurate, we can repeat layers:

Each time an image goes through a convolution layer, it gets more filtered. Every time it goes through a pooling layer, it gets smaller.

Then, in the end, we take our fully filtered and shrunk values and compile them to what’s called a condensed or connected layer.

Creating the connected layer

We make a condensed/connected layer for each of our possible outputs. If we were deciding between Xs and Os, we would have one connected layer of Xs and another for Os.

If we were deciding between cats and dogs: we would have one connected layer for cats and another for dogs.

If we were creating a CNN for all the letters in the alphabet, we would have one connected layer for each of the 26 letters (26 connected layers.

… Get the point?

Let’s use the example of classifying Xs vs Os.

These are the final connected layers for X vs O. Both of these layers stress different values, and have weights connected to each pixel in the connected layer.

So, using this prediction model, when we get a new set of numbers, we compare it to the connected layer based on the weights.

  • The thickness of the line determines how high of a weight is given to each pixel. At first these weights are random, but as the machine learns, it will readjust these values.

The CNN would categorize the probability for each outcome (X, O) based on weights and math. It would come up with a prediction score for each outcome, the higher the prediction, that is the outcome the neural network will choose.

In this case, the neural net gives X a .92 score and O a .51 score.

These votes create a fully connected layer. They take in a bunch of inputs and determine weighted outputs.

In order to improve the accuracy of the net, we can stack many fully connected — or you can call them “voting” layers. Based on the inputs, the computer will determine the probability of one of the outputs (in this cases X or O).

Voila! A CNN. This is what it will do:

We just uncovered the mystery of what’s inside the CNN black box… I mean it’s really a white box… but we’re still going to call it a black box.

We can think of this compilation of layers as ingredients to a fancy chocolate chip cookie. The input image can be the basic, boring ingredients no one really cares about (flour, sugar, butter… what are cookies even made out of?). But the different layers are when fun kicks in. Convolution layer is the chocolate chips, ReLU = caramel chunks, pooling layer = marshmallows, fully connected (voting layer) = sprinkles. Bomb cookie. Bomb convolutional neural network.

This is the overall structure of a CNN:

When we set up a CNN, we don’t actually need to know all the nitty-gritty (weights for the fully connected layer, features for the convolutional layer, etc).

We get these numbers and values through processes called back-propagation and gradient descent. — These will be discussed in more detail in an upcoming article.

Recap Steps for a CNN

  1. do feature mapping → create a convolutional layer
  2. Simplify your feature map using max pooling
  3. Create ReLU layer using the feature map
  4. Create a fully connected layer
  5. Connect all of these layers → CNN

Now, computers and humans will be able to classify rubber ducks.

If you enjoyed definitely clap + stay tuned for more coming up!

Please connect on LinkedIn !!

Sign up for my monthly newsletter:

17 yo building better maternal healthcare in developing countries.