An Introduction to Deep Learning

Book Summary: Chapter 1 of Deep Learning by Ian Goodfellow, Yoshua Bengio, Aaron Courville

Dickson Wu
12 min readSep 1, 2021

Humans have long dreamed of creating intelligent machines. The Ancient Greeks dreamed of gods creating such artificial life. But after thousands of years, we have become those gods.

Artificial Intelligence (AI) is now a booming field that aims to solve the world's biggest problems! It has many applications and research topics. The beginning of AI was focused on tasks that were hard for humans — things that required formal mathematical rules.

Take chess for example. IBM’s Deep Blue beat Garry Kasparov back in 1997. But chess isn’t that complicated. It has clearly defined rules and environments. You can encode those rules into the program and it can figure it out by itself.

But then AI moved on to tasks that were easy for humans → That was the real challenge. How can you mathematically tell the difference between a picture of a bee or a three? It’s automatic for us, but machines have a real hard time with this. That’s the challenge, trying to encode the intuitive rules into machines.

And that’s what this book is about — solving problems that can’t be formalized mathematically. More specifically, we’re solving it in a manner where we never specify anything — we simply let the network learn by itself.

It first extracts simpler concepts, then stacks those simpler concepts to more and more complex ones. Until eventually we get a program that can solve the problem. If we draw it out, it ends up being a deep, multi-layer graph. Hence the name of this approach being: Deep Learning.

There are other approaches to solving these tasks. One approach is a Knowledge Base approach — where you try and hard code knowledge of the world into machines — none of them had a major success.

The failure of the Knowledge Base approach suggests that we need to let the machines roam and learn on their own — that’s called Machine Learning (ML). Simple algorithms in ML-like logistic regression or naive Bayes can recommend cesarean deliveries, or determine whether emails are legitimate or spam.

But these simple algorithms depend heavily on the representation of the data. Like for cesarean deliveries — it depends on the data that the doctor tells the system. Each piece of information that the doctor gives is a feature. So logistic regression can figure out the relationship between features to predict the correct outcome, but it doesn’t deal with extracting the features itself. If you gave it an image it would just go ¯\_(ツ)_/¯

This means it’s dependent on what features it gets but also how these features are presented. The better we select the features, and the better we represent the data, the easier it is for the model to learn:

The power of how you represent your data

And we can do that for many tasks like speaker identification by estimating the size of people’s vocal tract. But for some other tasks, it’s not so clear.

Let’s say we wanted to identify a car from a picture? You could say cars have wheels, but how do you extract that from pixels? Plus the wheels could have effects like the sun shining off of it, or a shadow, or something blocking it. It would be a nightmare to encode those rules.

So let’s throw out the human from here too. Get the machine to learn how the representations relate to each other, but also how to extract these representations from the data. That’s called representation learning. This lets them learn much faster and robustly.

One algorithm is called an Autoencoder:

AutoEncoder

We have an encoder that takes the original image and compresses it down. Then we get a decoder to take the compressed information and then scale it back up to its original image. The compressed representation is called factors of variation. But what do the factors of variation represent?

It depends. These factors often aren’t high-level abstractions of the data — rather it’s latent factors that affect the image itself. For example: If you trained an autoencoder for cats and dogs — the factors of variation won’t be cat or dog, lots of fur or little fur. Those are high-level abstractions that we humans see. The factors of variation will be at a lower level.

But the problem is that there are so many factors of variation in the real world. Thus we have the disentangle the factors and get rid of factors that don’t matter. This, combined with the fact that it’s hard to get high-level factors, means that’s it’s nearly as hard to get the representations as to learn the task itself — thus it’s not as useful to us.

Deep Learning to the rescue! Like we mentioned before, deep learning tackles the problem of extracting the representation by stacking simpler concepts together to iteratively get more and more complex results. (Lines → Corners → Shapes → Head → Human!)

A core example of a DL model is the multilayer perception (MLP) → Where it’s just a mathematical function stacked again and again on top of each other. Where each layer extracts higher levels of representation of the previous layer.

Deep Learning

Another way to think about it is that deep learning is all about being… deep. A layer is just executing lots of instructions in parallel. The deeper your network, the more instructions you can do in sequence → Which means it’s more powerful because you can leverage the information you got in previous instructions.

This means that previous layers aren’t necessarily higher abstractions of the input data, rather they could be storing state information to help out the program.

The depth of the program itself is actually debated on how you count it.

Different ways of counting depth

But it really doesn’t matter. Bottom line — Deep learning is about having deeper and more learned functions than traditional machine learning.

A beautiful diagram of the sets of Machine Learning
Flow Chart of different sets of ML. Shaded boxes = learned

Who Should Read this Book?

Students + software engineers who want to learn about ML! The book is split into 3 parts. Part I = basic math + ML concepts. Part II = Established Deep Learning algorithms. Part III = Research.

Feel free to skip the chapters that aren’t relevant to you!

Structure of the Book

Historical Trend in Deep Learning:

Instead of going through a step by step historical walkthrough of deep learning, let’s go through some key trends:

  • It has a long and rich history. Going by many names, increasing and decreasing in popularity
  • Usefulness is proportional to the amount of training data
  • Model sizes are proportional to improvements in hardware and software infrastructure
  • Accuracy and complexity of tasks solved is increasing with respect to time

The Many Names and Changing Fortunes of Neural Networks:

It started in the 1940s! Deep learning is actually an old field that has had summers and winters. It was known as Cybernetics in 1940s-1960s. Then Connectionism in the 1980s-1990s. And now Deep Learning in 2006-?

Early forms of learning algorithms were to mimic biological learning — hence the name Artificial Neural networks. They’re inspired by biological brains. But it’s not used to be realistic models of biological brains.

There are 2 main perspectives why we model it after biological brains: A) Proof that intelligence is possible, and we’re just reverse engineering it and duplicate its functionality. B) We could start answering questions about human intelligence.

The earliest models was linear regression. We have inputs x (x_1, …, x_n) and weights (w_1, … w_n) and they’re just combined in an element-wise fashion: x_1 * w_1, …, x_n * w_2 = output (y). This can be put in the form: y = f(x,w). This is called cybernetics.

This worked and it could recognize between 2 categories by seeing if the output was positive or negative. But how do we learn the weights? Well at first they were human tuned, but afterwards, we got adaptive linear element (ADALINE).

Those 2 ideas have affected machine learning greatly — f(x,w) is the ancestor of modern linear models. While ADALINE is a special case of Stochastic gradient descent, whose variants are the dominant optimizers today.

But these linear models had limitations — like not being able to learn the XOR function → When critics pointed this out, it caused a winter.

<side_tangent>

Modern machine learning isn’t taken as much inspiration from neuroscience. Instead, it’s going more about the multiple levels of composition route. But why isn’t it taking inspiration from neuroscience? Because we simply don’t know enough about the brain.

Neuroscience does give us inspiration and proof that some things do work. For example: Having a single model that is able to do many tasks. Ferrets are able to “see” the auditory part of the brain by rewiring visual signals there. Thus 1 algorithm is able to solve many different tasks.

This insight has led to more fusing between different fields of natural language processing, computer vision, reinforcement learning and other fields. They’re still separate, but research groups often study all of them simultaneously.

Neuroscience does give us guidelines. The Neocognitron was inspired by the mammalian visual system — which became the basses of convolutional neural networks. Modern neural networks use the rectified linear unit — but we used to have more complicated versions of it that were inspired by brains.

Neuroscience is a good source of inspiration — but it’s not a rigid rule. Real brains don’t compute like rectified linear units. But replicating the brain anymore doesn’t seem to grant improvements. Biology has inspired architecture, but biological learning hasn’t been touched that much.

The media likes to pump up how deep learning is like simulating the brain — it’s not correct. Neuroscience inspires deep learning, but deep learning also takes inspiration from linear algebra, probability, information theory and numerical optimization. Some researchers like taking inspiration from nature, others do not.

But there is a field that tries to simulate the brain called computational neuroscience!

</side_tangent>

But summer comes again and this time around we have connectionism (or parallel distributed processing). Inspiration came from cognitive science and tried to ground things into neural network implementations. The central idea of connectionism is that lots of connected neurons can give rise to intelligent behaviour.

One concept that arose was distributed representation. Imagine we had a model that had to distinguish between cars, trucks and birds. But each with 3 colours: red, blue, green. So the traditional way of doing it is: Create 9 classes: red car, blue car, green car, red truck … But the problem with it is that each neuron must learn the object and the colour → Why don’t we simplify it down to 6 classes? 3 objects and 3 colours. Such that each neuron focuses on 1 task.

Another awesome concept is backpropagation! Which is the dominant way of training deep models. Sequence models (recurrent models) also arose around this time, thus LSTMs were invented.

This second wave of deep learning died because too many startups were making ambitious claims — but the research didn’t match expectations, thus investors were disappointed. Plus other ML fields were making lots of advancements like Kernel Machines and Graphical Models. Winter came.

But the Canadian Institute for Advanced Research + Neural Computation and Adaptive Perception (NCAP) and research groups lead by Geoffrey Hinton, Yoshua Bengio and Yann LeCun kept deep learning alive.

People thought that they were hard to train — but they worked fine. It’s just computation power that was lacking.

But 2006 is when the third summer started. It started with breakthroughs that eventually beat other fields of ML. Advancements on strategies to train them, increasing generalization and training deeper networks. This summer has yet to stop {let’s hope it never does!}

The beginning of the third summer was focused on unsupervised learning techniques — but today we’re using much more supervised learning techniques to train our models.

Increasing Dataset Sizes:

Dataset Sizes

Back then, it was extremely hard to train models. Only experts could do it. But the skill to train a model decreases as the training data increases — which is exactly what has happened!

The algorithms are almost the same, but the methods of training them (simplifying training) have changed. Increased benchmark size increases the performance of our models. More data is coming in thanks to the digitalization of the world — thus we generate a ton of data.

The rule of thumb is that 5,000 labelled examples per category will yield acceptable performance, while datasets with at least 10,000,000 labelled examples will match or exceed human performance. Working with smaller datasets = a field of research + working with unlabeled data.

MNIST: “The drosophila of machine learning” — Geoffrey Hinton

Increasing Model Sizes:

Let’s compare our models to nature!

In terms of connections per neuron — we crush nature. Nature’s neurons are much sparser, while deep learning’s neurons are much denser. Models have beaten biology by an order of magnitude for over a decade.

Average connections per neuron
Labels for the diagram above

But in terms of the number of neurons — our models are tiny!

The models that we have fewer neurons than frogs! Although the size of models is expected to double every 2.4 years due to faster computers + more memory + bigger datasets. The trend looks to continue for decades — meaning we’ll hit human intelligence by the 2050s. But biological neurons are more complicated than artificial neurons — thus we might need even more neurons.

{This year (2021), Google trained a 1.6 Trillion parameter model. But that’s parameters — not neurons. The brain has 100 trillion synapses.}

In retrospect, it’s kind of obvious why models before the third summer couldn’t do anything sophisticated — they had fewer neurons than a leash!

Increasing Accuracy, Complexity and Real-World Impact:

Before, tasks were restrained to super simple tasks like having low resolution cropped input images so the object was more apparent. But nowadays we can have huge high-resolution images where no cropping is needed. Early networks could only have 2 classes to classify between — but later ones could have thousands.

ImageNet is an example of a giant dataset with over 20,000 classes. Before 2012, the state of the art top-5 error rate was 26.1% → CNN’s brought it down to 15.3%, crushing the competition. In 2016 it was 3.6%. In 2021 it’s 1.2% top 5 error, and 9.8% for top 1 error.

ImageNet Top-5 Error

For speech recognition, results stagnated in 2000. But when deep learning came around — error rates were cut in half. Image Segmentation and Image classification also achieved spectacular success!

Models are also able to solve increasingly complex tasks like sequence to sequence learning (machine translation), self-programming (like sorting lists), reinforcement learning (human-level performance on Atari games).

Top companies are using them, and software infrastructure is advancing. It’s also contributed to other fields of science.

To summarize — Deep learning has a rich history of winters and summers. It’s constantly being inspired by fields like neuroscience, statistics and applied math. This last summer has been booming thanks to powerful computers, datasets, and techniques to train deep networks to achieve state-of-the-art results everywhere! The future is full of challenges and opportunities to push the state of the art!

Thanks for reading! I’m Dickson, an 18-year-old ML Engineer @Rapyuta Robotics who’s excited to impact billions 🌎

If you want to follow along on my journey, you can join my monthly newsletter, check out my website, and connect on LinkedIn or Twitter 😃

--

--

Dickson Wu

Hi I’m Dickson! I’m an 18-year-old innovator who’s excited to change the course of humanity for the better — using the power of ML!