CS230 • Introduction to Deep Learning
- Overview
- The Growth of Deep Learning Research
- Why Now?
- Deep Learning
- Logistic Regression as a Neural Network
- Python and Vectorization
- Citation
Overview
-
The gist of deep learning, and the algorithms behind it, have been around for decades. However, we saw that as we started to add data to neural networks, they began to perform much better than traditional machine learning algorithms. With advances in GPU computing and the vast amount of data we now have available, training larger neural networks has become easier than ever before. With more data, larger neural networks have been shown to outperform all other machine learning algorithms.
-
The following figure shows how larger neural networks, given sufficient data, outperform both smaller networks and traditional machine learning methods.
-
Artificial intelligence (AI) can be broken down into several subfields: deep learning (DL), machine learning (ML), probabilistic graphical models (PGM), planning agents, search algorithms, knowledge representation (KR), and game theory. Among these, the only subfields that have dramatically improved in performance in recent years are deep learning and machine learning.
-
The following figure illustrates how the performance of DL and ML has exploded relative to other subfields of AI such as PGMs, planning agents, search algorithms, knowledge representation, and game theory.
The Growth of Deep Learning Research
-
The rise of artificial intelligence is not confined to computer science departments. The number of annually published papers in AI has outpaced that of computer science overall, meaning researchers from other fields such as physics, chemistry, astronomy, and material science are actively contributing to AI.
-
The following figure shows the rapid increase of AI-related papers, outpacing general computer science publications.
- The keyword “Neural Network” has also seen an exponential rise in research publications, particularly in the 2010s. The following figure illustrates the rapid increase in papers published with the keyword “Neural Network,” demonstrating the surge of interest in deep learning.
-
Between 2014 and 2017, the number of Scopus papers on Neural Networks had a compound annual growth rate of 37%. This surge was especially visible in machine learning and computer vision research. Deep learning has since reshaped the frontiers of machine learning, natural language processing, and computer vision.
-
Applications of deep learning have already permeated our everyday lives. Conversational assistants such as Siri or Alexa, face verification for unlocking phones, self-driving car perception systems, itinerary mapping, sentiment analysis, and machine translation all rely on deep learning.
Why Now?
-
The boom of AI in the last decade is largely driven by three factors:
- Digitization: An explosion of available data due to sensors, online behavior, digital communication, and large-scale datasets.
- Computation: Advances in GPUs, TPUs, and distributed training have enabled efficient large-scale optimization.
- Algorithms: Advances in neural architectures (e.g., CNNs, RNNs, Transformers) and training techniques.
-
At its core, machine learning is about learning a function that maps data to labels and then using that function to make predictions on new data. Unlike linear regression, which is limited to fitting simple hyperplanes, deep learning scales to massive data by leveraging multiple nonlinear transformations and layered representations.
Deep Learning
What is a Neural Network?
The aim of a neural network is to learn representations that best predict the output \(y\), given a set of features as the input \(x\).
- The following figure shows the simplest possible neural network, with a single input feature (size of a home) and a single output (price).
-
Here, the neuron is the component that tries to learn a function mapping \(x \rightarrow y\). Because this example only has one neuron, it is the simplest case. We can build more complex neural networks by stacking neurons. For example, if instead of just the size of a home we also had the number of bedrooms, the zip code, or local wealth indicators, we would represent these as additional inputs to the network.
-
The following figure illustrates how intermediate connections between inputs may themselves encode useful features (such as family size, walkability, and schooling). Put simply, intermediate neurons represent higher-level abstractions of the raw input features.
Neural networks work well because we do not need to explicitly hand-engineer these intermediate values; instead, they are learned automatically. When all inputs in one layer are connected to all outputs of the next, we call this a fully connected (dense) layer.
The following figure shows a standard neural network consisting of an input layer, hidden layer, and output layer, all fully connected.
Applications of Neural Networks
-
Almost all of the hype around machine learning has centered around supervised learning, where we train a model on input-output pairs. Neural networks have enabled breakthrough applications across many domains. Below are representative examples, many of which have been explored by students in research and applied projects.
-
Sign Language Detection:
- Task: Given an image of a hand showing a number (0–5) in sign language, predict the number.
- Application: Sign translation and assistive technologies.
- The following figure shows a neural network trained to classify hand gestures into sign language digits (0–5).
-
The Happy House (Sentiment-based filtering):
- Task: A playful application where only smiling people are let inside.
- Application: A simplified demonstration of sentiment analysis from images.
- The following figure demonstrates an image-based sentiment analysis system where only smiling individuals are positively classified.
-
Face Recognition:
- Task: Given an image of a person, predict their identity.
- Application: Authentication, surveillance, and photo tagging.
- The following figure shows face recognition, where a neural network outputs an identity prediction given a face image.
-
Object Detection for Autonomous Driving:
- Task: Detect and classify objects such as pedestrians, traffic signs, and vehicles in real-time.
- Application: Perception systems in self-driving cars (e.g., YOLOv2).
-
The following figure shows a neural network detecting multiple objects in an image, a core requirement for autonomous driving.
- The following figure demonstrates car detection for autonomous driving perception tasks, powered by convolutional neural networks.
-
Sports Analytics: Goalkeeper Shoot Prediction:
- Task: Predict the optimal region where a soccer player should shoot the ball to maximize the chance of scoring or assisting.
- Application: Sports strategy and analytics.
- The following figure shows a neural network predicting optimal shot placement for a soccer goalkeeper scenario.
-
Art Generation (Neural Style Transfer):
- Task: Generate an image by combining the content of one picture with the artistic style of another.
- Application: Creative AI in digital art and media.
- The following figure illustrates neural style transfer, where content and artistic style are merged into a new synthetic image.
-
Music Generation:
- Task: Train a sequence model to generate music.
- Application: Algorithmic composition and sound design.
- The following figure shows music generation using a recurrent or sequence model that learns temporal patterns.
-
Text Generation:
- Task: Given a large corpus of text (e.g., Shakespeare poems), generate new text in a similar style.
- Application: Creative writing, chatbots, and dialogue systems.
- The following figure illustrates text generation, where a sequence model produces text stylistically similar to Shakespeare.
-
Sentiment Analysis and Emoji Prediction:
- Task: Map sentences to emojis representing their sentiment.
- Application: Messaging apps, predictive emoji, and smart keyboards.
- The following figure demonstrates a sentiment analysis task where the output is an emoji matching the text’s sentiment.
-
Machine Translation:
- Task: Translate text between languages.
- Application: Breaking language barriers, international communication.
- The following figure shows machine translation, one of the flagship applications of deep learning in natural language processing.
-
Trigger Word Detection:
- Task: Build a system that activates when a specific word is spoken (e.g., “activate,” “Hey Siri”).
- Application: Voice assistants like Siri, Alexa, and Cortana.
- The following figure demonstrates trigger word detection for wake-word systems in speech assistants.
Architectures of Neural Networks
-
As we have already seen, neural networks can vary significantly in architecture. While real estate or online advertising tasks may use fully connected layers, photo tagging often uses convolutional neural networks (CNNs), sequence modeling may use recurrent neural networks (RNNs), and autonomous driving requires hybrid custom models.
-
The following figure shows the basic structure of different neural network architectures, including CNNs, RNNs, and hybrid models for specialized tasks.
-
Neural networks can be applied to both structured and unstructured data. Structured data may take the form of database entries, while unstructured data may be audio, images, or text. Historically, unstructured data such as image recognition has been difficult for computers but easy for humans. Neural networks have closed this gap.
-
The following figure contrasts structured datasets (e.g., relational databases) with unstructured datasets (e.g., images, speech, text), where deep learning has been transformative.
Logistic Regression as a Neural Network
- Logistic regression can be interpreted as a simple neural network with no hidden layers. While the model is limited in capacity, it forms the foundation for understanding more complex neural architectures.
- By grounding the mathematics of logistic regression in both theory and real-world applications, we see how this simple model serves as the conceptual gateway to modern deep learning approaches.
Notation
-
Each training set is composed of training examples. We denote \(m_{\text{train}}\) as the number of training examples in the training set, and \(m_{\text{test}}\) as the number of examples in the testing set.
-
The \(i^{th}\) training example is defined as:
- We can represent the entire dataset compactly as:
Binary Classification
-
A binary classifier assigns inputs into two categories, typically yes (1) or no (0). For example, an image of a cat may be classified as cat (1) or non-cat (0).
-
The following figure shows how colored images can be represented using red, green, and blue channels. Each pixel is encoded as a tuple storing RGB values.
- For an image of size \(n \times m\), we define the input as a column vector stacking the three channels:
Logistic Regression Model
- Given input features \(x \in \mathbb{R}^{n_x}\), we define the prediction as:
- where \(w \in \mathbb{R}^{n_x}\), \(b \in \mathbb{R}\), and the sigmoid function is:
-
This ensures \(0 \leq \hat{y} \leq 1\).
-
The following figure shows the sigmoid function, mapping real-valued inputs into the interval \([0, 1]\).
Logistic Regression Cost Function
- We optimize a cost function that penalizes incorrect predictions:
- The overall cost function is the average over all \(m\) training examples:
-
This function is convex, meaning there exists a global minimum.
-
The following figure shows the binary classifier’s loss functions for cases when \(y=1\) and \(y=0\).
Gradient Descent
-
We minimize the cost using gradient descent:
\[w := w - \alpha \frac{\partial J}{\partial w}, \quad b := b - \alpha \frac{\partial J}{\partial b}\]- where \(\alpha\) is the learning rate.
-
The gradients can be derived as:
\[\frac{\partial J}{\partial w_j} = \frac{1}{m} \sum_{i=1}^m (a^{(i)} - y^{(i)})x_j^{(i)}, \quad \frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^m (a^{(i)} - y^{(i)})\]- where \(a^{(i)} = \hat{y}^{(i)} = \sigma(w^T x^{(i)} + b)\).
-
The following figure illustrates a convex cost function where gradient descent converges to the global minimum.
Computation Graphs and Backpropagation
-
A computation graph represents the sequence of operations required to compute the output. This also makes it easier to calculate derivatives using backpropagation.
-
For example, given:
- The following figure shows a computation graph for \(y(a, b, c) = 3(a + bc)\), allowing systematic forward and backward propagation.
-
The forward pass is:
- Compute \(u = bc\)
- Compute \(v = a + u\)
- Compute \(J = 3v\)
-
The following figure shows a computation graph for \(y(a, b, c) = 3(a + bc)\) with \(u\), \(v\), and \(J\).
- The backward pass uses the chain rule:
- Thus:
From Logistic Regression to Student Projects
-
While logistic regression itself is simple, it provides a foundation for more complex models that power real-world projects. Many student projects demonstrate how classification, regression, and structured prediction tasks extend directly from logistic regression principles.
-
Coloring Black & White Pictures with Deep Learning:
- Task: Given grayscale input, predict the missing color channels.
- Relation: Fundamentally a regression problem where each pixel is classified into a distribution over RGB values.
- The following figure shows a project where grayscale photos are automatically colorized using supervised learning.
-
Predicting the Price of an Object from a Picture:
- Task: Given an image of a bike, predict its price.
- Relation: Combines regression with interpretability — the model learns to focus on discriminative regions such as extra wheels in kids’ bikes.
- The following figure shows how a deep learning model predicts bike prices by attending to visually discriminative regions.
-
Image-to-Image Translation with Conditional GANs:
- Task: Generate a map from a satellite image.
- Relation: Extends logistic regression to adversarial learning, where networks must classify “real” vs “fake” in training.
- The following figure shows generated map images using different architectures (baseline, U-Net, ResNet variants) from input satellite images.
-
LeafNet: Tree Species Identification:
- Task: Predict the species of a tree given a leaf photograph.
- Relation: A standard supervised classification problem, extending logistic regression to multiclass neural networks.
- The following figure shows LeafNet, a student project that predicts tree species from leaf photographs.
Python and Vectorization
- One of the major challenges in deep learning is the size of the datasets involved. Modern applications — such as training models for image-to-image translation, speech recognition, or translation — may require millions of examples. To train on such data efficiently, we need to eliminate slow, explicit loops and instead rely on vectorized operations that leverage optimized libraries like NumPy, as well as GPU parallelism.
Vectorization
-
The aim of vectorization is to remove explicit for-loops in code. Since deep learning requires repetitive operations over massive datasets, vectorization ensures that computations run efficiently at scale.
-
Consider logistic regression, where we compute
- A naive Python implementation might compute this with an explicit loop:
z = 0
for j in range(nx):
z += w[j] * x[j]
z += b
- This is inefficient. Instead, we use NumPy’s optimized function:
z = np.dot(w, x) + b
- This runs in optimized C under the hood, enabling parallelization.
Vectorizing Logistic Regression
- For logistic regression, the activation for the \(i^{th}\) example is:
- Stacking all training examples, we define the data matrix
-
The vectorized computation is:
\[Z = w^T X + b, \quad A = \sigma(Z)\]- where
- is applied elementwise.
-
In NumPy:
Z = np.dot(w.T, X) + b
A = 1 / (1 + np.exp(-Z))
- This replaces an \(O(mn_x)\) explicit loop with optimized matrix multiplication.
Vectorizing Logistic Regression’s Gradient Computation
-
The gradients of the cost function are:
\[\frac{\partial J}{\partial w} = \frac{1}{m} X (A - Y)^T, \quad \frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^m (a^{(i)} - y^{(i)})\]- where:
-
This leads to the vectorized NumPy implementation:
dZ = A - Y
dw = np.dot(X, dZ.T) / m
db = np.sum(dZ) / m
- Thus, instead of looping through all \(m\) training examples, we calculate gradients with a single matrix multiplication.
Broadcasting
-
Vectorization in NumPy is often paired with broadcasting, which automatically expands arrays to match dimensions without explicit duplication.
-
Suppose we want to normalize each column of a matrix \(A \in \mathbb{R}^{3 \times 4}\) by the column sums:
-
In NumPy:
s = np.sum(A, axis=0).reshape(1,4) A_normalized = A / s
- Here,
s
has shape (1,4), but NumPy automatically broadcasts it across the rows ofA
to match shape (3,4).
- Here,
-
Other examples:
- Scalar broadcast:
- Row vector broadcast:
- Column vector broadcast:
-
This eliminates the need for manually replicating matrices.
A Note on NumPy Vectors
- A subtle but important distinction is between rank-1 arrays and explicit matrices.
a = np.random.randn(5)
print(a.shape) # (5,)
- This is neither a row vector nor a column vector. Its transpose
a.T
does nothing. For clarity in deep learning, it is preferable to explicitly define column vectors:
a = np.random.randn(5,1) # shape (5,1)
- This ensures dot products and broadcasting behave as expected.
Why Vectorization Matters in Practice
-
Without vectorization, training modern neural networks would be computationally infeasible.
- Student projects like LeafNet or Satellite Map Translation require processing thousands of high-resolution images. Vectorization allows such models to train in hours instead of weeks.
- Industrial applications like machine translation or speech recognition operate on billions of words or hours of audio. Without vectorization and GPU acceleration, such training would be impossible.
-
At a deeper level, every neural network operation — convolution, recurrent computation, or attention — is ultimately implemented as vectorized linear algebra operations (matrix multiplications, tensor contractions).
- Thus, vectorization is the cornerstone of modern deep learning scalability.
Citation
If you found our work useful, please cite it as:
@article{Chadha2020IntroToDeepLearning,
title = {Introduction to Deep Learning},
author = {Chadha, Aman},
journal = {Distilled Notes for Stanford CS230: Deep Learning},
year = {2020},
note = {\url{https://aman.ai}}
}