Overview

  • Neural networks form the foundation of modern deep learning. They are powerful function approximators capable of learning complex, nonlinear relationships from data. Unlike classical machine learning methods such as linear regression or logistic regression, which are limited by their simple functional forms, neural networks can represent highly flexible mappings between inputs and outputs by stacking multiple layers of computation.

  • At the core, a neural network consists of:

    • Architecture: the design of the model, including the number of layers, number of neurons per layer, and choice of activation functions.
    • Parameters: the weights and biases that are learned from data during training.
    • Hyperparameters: design choices such as learning rate, optimization algorithm, and number of iterations that influence how parameters are updated.
  • Training a neural network involves two key processes:

    1. Forward propagation: inputs are passed through the layers of the network to produce an output (e.g., a prediction).
    2. Backward propagation: gradients of the loss function with respect to the parameters are computed and used to update the weights and biases.
  • This iterative process allows the network to minimize its loss function and improve predictive accuracy.

  • The intuition behind deep learning can be built up gradually:

    • Starting from binary classification problems such as cat vs. not-cat,
    • Extending to multi-class classification with one-hot or multi-hot labels,
    • Exploring how deeper layers capture more complex features through encodings,
    • And applying these concepts to practical tasks such as day–night classification, face verification, style transfer, and trigger-word detection.

Deep learning intuition

  • A model for deep learning is defined by two key components: architecture and parameters.

  • The architecture is the algorithmic design we choose, such as logistic regression, linear regression, shallow neural networks, or deeper networks with many hidden layers.
  • The parameters are the weights and biases that the model learns in order to transform an input into the correct output.

  • Mathematically: Input + Architecture + Parameters = Output

  • Things that can be tuned in a model:

    • Activation function
    • Optimizer
    • Hyperparameters
    • Loss function
    • Input format
    • Output format
    • Model architecture
  • The following figure situates the model and its tunable components (architecture, parameters, loss, activation functions, etc.) within the machine learning pipeline.

Multi-class classifiers and encoding schemes

  • Suppose we want to expand beyond binary classification. For example, instead of predicting whether an image is “cat” or “not cat”, we want to classify among three categories: cat, dog, and giraffe.

  • In binary classification, the weight vector is

    \[w = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_{n_x} \end{bmatrix},\]
    • where \(n_x\) is the number of features.
  • The following figure demonstrates an example binary classification use-case:

  • For multi-class classification, we extend the weights into a matrix of dimension \(n_x \times 3\):
\[w = \begin{bmatrix} \cdots & \cdots & \cdots \\ \cdots & \cdots & \cdots \\ \end{bmatrix}_{cats \;\;\; dogs \;\;\; giraffes}.\]
  • For the labels \(y\), we can choose different encodings:

    1. Integer encoding: \(y \in \{0,1,2\}\), where 0 = cat, 1 = dog, 2 = giraffe.
    2. One-hot encoding: represent as a vector with one entry set to 1, others 0. For example, a cat image would be:
    \[[1, 0, 0]\]
    1. Multi-hot encoding: useful if an image contains multiple classes simultaneously. For example, an image with both cat and dog would be:
    \[[1, 1, 0]\]

Activation functions in multi-class settings

  • Sigmoid: Each neuron’s output is independent; multiple classes may be predicted simultaneously if their probabilities exceed the threshold.
  • Softmax: Outputs are dependent, summing to 1. This gives a probability distribution across classes and is suitable when each image has exactly one class.

Encodings in deep networks

  • In deeper networks, earlier layers detect sim ple features like edges, while deeper layers capture more abstract representations such as eyes, noses, and eventually entire faces. These compressed, information-rich representations are called encodings.

  • The following figure illustrates how encodings emerge at different layers of the network, becoming progressively more abstract.

Day’n’night classification

  • Our goal is to create an image classifier that predicts whether a photo taken outdoors was captured during the day (0) or during the night (1).

Data

  • A robust model requires a sufficient dataset. From the cat classification example earlier, we estimated that around 10,000 labeled images were needed to achieve good accuracy. Since the day–night classification task is of similar difficulty, we begin with about 10,000 labeled examples.

  • These images can easily be obtained from online datasets or image repositories.
  • The dataset should be split into training and testing sets. A common split is 80% training, 20% testing.
  • Importantly, the split must be stratified, meaning both day and night images are represented proportionally in both training and test sets.

Input

  • We want the smallest image size that still allows both humans and the model to distinguish day from night. By comparing human recognition performance, we find that images of resolution 64 × 64 × 3 pixels are sufficient.

  • The following figure shows an example input image for the day-and-night classification task, demonstrating the visual differences the model must learn to distinguish.

Output

  • The output is binary:

    • 0 \(\rightarrow\) day
    • 1 \(\rightarrow\) night
  • Since the target values lie in [0,1], the sigmoid function is appropriate for the final activation:

\[\sigma(z) = \frac{1}{1 + e^{-z}}.\]

Architecture

  • Given the relatively simple nature of the task, a shallow neural network (with one hidden layer) should suffice. However, as seen in similar tasks such as cat classification, a 4-layer network can also provide high accuracy if desired.

Loss function

  • Because the last layer uses a sigmoid activation, the natural choice for the loss is the cross-entropy (logistic) loss, defined as:

    \[L(\hat{y}, y) = - \big[ y \log(\hat{y}) + (1-y) \log(1-\hat{y}) \big],\]
    • where \(y\) is the true label (0 or 1) and \(\hat{y}\) is the predicted probability.
  • This formulation penalizes confident misclassifications heavily, encouraging the network to converge toward probabilistic accuracy.

Face verification

  • Our next task is more complex: a school wants to use face verification to validate student IDs in facilities such as dining halls, gyms, and pools. In this setup, a student swipes their ID card, and the system checks if the face captured by the camera matches the stored face image in the database.

Data

  • The school maintains a dataset of one labeled picture per student, associated with their names. This dataset will form the reference database.

  • The following figure shows an example of a stored ID image of a student.

  • The following figure shows an example of a current camera capture of a student requesting access.

Input

  • Compared to day–night classification, face verification is a much harder task. The model must be robust to:

    • changes in facial pose,
    • variations in background lighting,
    • natural changes such as facial hair or glasses,
    • and outdated ID photos.
  • To capture sufficient detail, we choose a higher input resolution: 412 × 412 × 3 pixels.

Output

  • The output of the model is binary:

    • 1 \(\rightarrow\) ID verified (same person)
    • 0 \(\rightarrow\) ID not verified (different person)
  • A sigmoid activation at the final layer is again a good choice.

Why pixel-by-pixel comparison fails

  • A naïve approach would be to compute the pixel-by-pixel distance between the ID photo and the camera capture, predicting “verified” if the distance is less than a threshold. However, this fails due to:

    • sensitivity to lighting differences,
    • changes in appearance (make-up, beard),
    • or outdated photos.
  • Instead, we use a deep network to learn encodings: a compressed representation of each face in vector space.

Architecture

  • The following figure illustrates how a deep network maps faces into encodings. Encodings of the same person cluster close together, while those of different people are far apart.

  • We compare encodings using a distance metric. If the distance is below a threshold, the prediction is “ID verified.”

  • To train such a model, we use triplet training:

    • Anchor (A): the true identity image,
    • Positive (P): another image of the same person,
    • Negative (N): an image of a different person.
  • The model minimizes the distance between A and P while maximizing the distance between A and N.

  • The following figure shows the triplet setup used for training face verification models.

Loss function

  • The objective is
\[L = \|Enc(A) - Enc(P)\|_2^2 - \|Enc(A) - Enc(N)\|_2^2 + \alpha,\]
  • where:

    • \(Enc(X)\) denotes the encoding of image \(X\),
    • \(\alpha\) is a margin hyperparameter that prevents the trivial solution of collapsing all encodings to zero.
  • By minimizing this function, the network learns to pull encodings of the same person together and push encodings of different people apart.

Final model

  • The following figure shows the overall architecture of the face verification system, where inputs (anchor, positive, negative) are passed through the deep network, and gradients are computed with respect to the triplet loss.

Face recognition

  • In the previous section, our task was face verification: validating whether the student in front of the camera matched the ID being presented. Now, we move to a harder task: face recognition.

  • Here, the model is not told who the student claims to be. Instead, the model must identify the person directly from the image by comparing it to the database of stored student faces.

Data

  • Unlike verification (which needs one stored image per student), recognition requires multiple images of each person in the dataset. This is essential so the model can learn to produce robust encodings that generalize across variations in pose, lighting, and appearance.

  • For training, we rely on large public face datasets, which contain multiple photos per person.

  • The following figure illustrates the recognition goal: the model must match a query image to the correct identity in the database.

Input and encodings

  • As in face verification, the input resolution is high (e.g., 412 × 412 × 3) to preserve facial detail. The deep network processes the image into an encoding vector, which can be compared across individuals.

  • The goal is to ensure:

    • encodings of the same person cluster closely,
    • encodings of different people remain far apart.

Training with triplets

  • The same triplet training approach applies here: anchor, positive, and negative examples are used to structure the encoding space.

  • This ensures the model is not just learning a pixel-matching function but instead constructing a representation of faces in a latent embedding space.

  • Once the encodings are trained, face recognition can be implemented as a database search problem:

    1. Given an input face image, compute its encoding vector.
    2. Compare it against all encodings in the student database.
    3. Find the nearest encoding(s) using a distance metric such as Euclidean distance.
  • A common algorithm for classification is k-nearest neighbors (k-NN). The runtime complexity for scanning a database of size \(n\) is \(O(n)\).

Example: clustering faces

  • Suppose we have thousands of unlabeled photos of 20 different people stored on a phone. By encoding each image, we can cluster them using k-means. This groups images of the same person together, even without explicit labels, enabling automatic organization of photo libraries.

Art generation (Neural Style Transfer)

  • The task of neural style transfer is to take a content image and re-render it in the artistic style of a style image. The resulting output preserves the semantic structure of the content while adopting the textures, colors, and brushstrokes of the style.

Data

  • Unlike supervised learning tasks, neural style transfer does not require a labeled dataset. We simply need a content image (the image whose structure we wish to preserve) and a style image (the image whose artistic qualities we want to mimic).

  • The following figure shows the content image and the style image provided as inputs for neural style transfer.

  • The following figure shows a generated image, where the content has been repainted using the artistic style of the style image.

Input and output

  • Input: content image \(C\) and style image \(S\).
  • Output: generated image \(G\) that minimizes a style-content loss.

Architecture

  • To achieve this, we rely on a pretrained convolutional neural network (CNN) trained on a large dataset such as ImageNet. CNNs learn feature hierarchies where:

    • early layers capture low-level features like edges and contours,
    • deeper layers capture high-level representations like object parts and textures.
  • The following figure illustrates how pretrained convolutional neural networks can separate content information from style information at different layers.

  • This allows us to separate content representation from style representation.

  • When the content image is forward propagated through the CNN, we extract its content encoding. When the style image is passed through, we compute its style encoding, often represented by a Gram matrix, which captures feature correlations across channels.

Loss function

  • The optimization problem is to generate an image \(G\) such that its content matches the content of \(C\) and its style matches the style of \(S\).

  • The total loss is a weighted sum of content and style losses:

    \[L = \| Content_C - Content_G \|_2^2 + \| Style_S - Style_G \|_2^2\]
    • where:

      • \(Content_C\) = content features of the content image,
      • \(Content_G\) = content features of the generated image,
      • \(Style_S\) = style features (Gram matrix) of the style image,
      • \(Style_G\) = style features of the generated image.
  • Importantly, unlike typical deep learning tasks, we are not updating network weights here — instead, we iteratively update the pixels of the generated image \(G\) to minimize this loss.

Iterative optimization

  • The process is:

    1. Initialize \(G\) (e.g., as random noise or the content image).
    2. Forward propagate \(G\) through the CNN.
    3. Compute the loss relative to \(C\) and \(S\).
    4. Backpropagate the gradients, but apply them to the pixels of \(G\) instead of network parameters.
    5. Iterate until convergence.
  • The following figure shows the final architecture of the style transfer process, combining content and style losses to generate a stylized image.

Reference

2.1.5 Trigger word detection

  • Another practical application of deep learning is trigger word detection, where the goal is to identify whether a specific keyword (e.g., “activate” or “active”) appears within a short audio clip. This type of model is used in voice assistants like Siri, Alexa, or Google Assistant, where the device must wake up when a trigger word is spoken.

Data

  • We require a large set of 10-second audio clips, with a balanced distribution of:

    • positive words (e.g., “activate”),
    • negative words (all other words not equal to the trigger),
    • background noise (segments where no words are spoken).
  • Diversity in accents, genders, ages, and speaking conditions is essential to make the model robust.

  • A clever strategy for data generation is to synthesize clips:

    1. Collect independent recordings of positive and negative words, as well as background noise samples.
    2. Overlay these words on the background noise at random times.
    3. Automatically generate labels by inserting a 1 in the output sequence at the position(s) where the trigger word occurs.
  • This approach avoids the need for tedious manual labeling of each audio clip.

Input

  • The model input is a 10-second audio waveform, preprocessed into spectrogram features. The resolution (sample rate, window size, stride) should be chosen based on the minimum viable resolution at which humans can reliably recognize words.

Output

  • Several output strategies exist:
  1. Binary classification of the entire clip:

    • \(y = 0\) if the word does not appear,
    • \(y = 1\) if it does.
    • Requires very large datasets, since the model must capture all possible positions of the trigger word.
  2. Single positive pulse output:

    • Output sequence of 0’s with a single 1 at the moment the trigger word occurs.
  3. Extended pulse output (preferred):

    • Instead of a single 1, the output has a short window of consecutive 1’s during the trigger word, making the training data less imbalanced and more robust.
  • The following figure illustrates this labeling strategy: green = trigger word, red = non-trigger words, black = silence.

Architecture

  • Because trigger words occur in temporal sequences, the natural choice of model is a recurrent neural network (RNN) or one of its modern variants (LSTM, GRU). These architectures capture time dependencies across the audio sequence.

Loss function

  • Since the task is binary classification at each timestep, we use binary cross-entropy loss:
\[L = - \frac{1}{T} \sum_{t=1}^{T} \left[ y_t \log(\hat{y}_t) + (1-y_t)\log(1-\hat{y}_t) \right],\]
  • where:

    • \(T\) = total number of timesteps in the audio sequence,
    • \(y_t \in \{0,1\}\) = true label at time \(t\),
    • \(\hat{y}_t\) = predicted probability at time \(t\).
  • Alternative losses, such as triplet loss (as in face recognition), can be experimented with in more advanced setups.

Key challenge: data generation

  • The success of a trigger word detection system often depends more on clever data collection and labeling than on the model architecture itself. Automated synthesis (overlaying words on noise) is a powerful way to scale training data while maintaining accurate labels.

App implementation

  • So far, we have explored various deep learning applications — from image classification to face recognition, style transfer, and trigger word detection. A natural next step is deployment: how do we deliver these models in real-world settings?

  • Let’s consider a simple but realistic case: a cat classifier app. The model has already been trained to determine whether an input image contains a cat. Now, we want to integrate it into a phone app.

Two deployment strategies

  • There are two common ways to deploy deep learning models in apps:
  1. Server-based implementation

    • The app captures an image and sends it to a remote server.
    • The server holds the model (architecture + parameters), performs inference, and returns the prediction to the app.

    Advantages:

    • The app remains lightweight, since the heavy model runs on the server.
    • Model updates are simple: retraining or replacing the model on the server automatically propagates to all users.
  2. On-device implementation

    • The entire model (architecture + parameters) is stored locally on the device.
    • Inference runs directly on the device hardware.

    Advantages:

    • Faster predictions, since the app avoids server communication.
    • Works offline, enabling predictions without an internet connection.

Tradeoffs

  • Server-based solutions suit large models and frequent updates but depend on connectivity.
  • On-device solutions ensure low latency and offline operation but require optimizations to compress models for limited device memory and compute.

  • This tradeoff between scalability and efficiency is central to applied AI deployment. As neural network models continue to grow, techniques such as model quantization, knowledge distillation, and edge-optimized architectures (e.g., MobileNet) have become critical to making on-device inference practical.

2.2 Shallow neural networks

  • While deep learning often involves stacking many layers of neurons, it is useful to first study shallow neural networks — networks with only a single hidden layer. These models allow us to develop the core machinery of forward propagation, backward propagation, and gradient descent before extending to deeper networks.

  • The following figure shows a shallow neural network with one hidden layer of three neurons and a single output neuron.

2.2.1 Neural network overview

  • Each layer in a neural network computes two key steps:

    1. A linear transformation using weights and biases.
    2. A nonlinear activation function that squashes the result into a desired range (often between 0 and 1).
  • For the \(i^{th}\) layer, the linear activation is defined as:

    \[z^{[i]} = W^{[i]} x + b^{[i]}\]
  • The nonlinear activation is then:

    \[a^{[i]} = \sigma(z^{[i]})\]
    • where \(\sigma(\cdot)\) is the activation function.
  • The following figure shows how each neuron combines inputs with learned weights and bias, then passes them through a nonlinear activation.

2.2.2 Neural network representations

  • We denote the input layer as \(a^{[0]} = x\). Hidden layers refer to neurons whose activations are not directly observed in the training data. Importantly, when counting network depth, we include hidden and output layers but not the input layer.

  • Thus, in a shallow network with one hidden layer and one output layer, the network is considered 2 layers deep.

Computing a shallow network’s output

  • Consider a hidden layer with 3 neurons and an output layer with 1 neuron. For the first hidden layer, we compute:
\[z^{[1]} = W^{[1]} x + b^{[1]}, \quad a^{[1]} = \sigma(z^{[1]})\]
  • For the output layer, we compute:
\[z^{[2]} = W^{[2]} a^{[1]} + b^{[2]}, \quad a^{[2]} = \sigma(z^{[2]})\]
  • This step-by-step expansion is inefficient in code. Instead, we use vectorization, where the hidden layer activations are computed in a single matrix multiplication.

  • The following figure shows a shallow network’s computation flow in compact form.

2.2.4 Vectorization across multiple examples

  • Suppose we have \(m\) training examples stored in a data matrix:
\[X = \begin{bmatrix} x^{(1)} & x^{(2)} & \cdots & x^{(m)} \end{bmatrix}\]
  • We can vectorize forward propagation as:
\[Z^{[1]} = W^{[1]} X + b^{[1]}, \quad A^{[1]} = \sigma(Z^{[1]})\] \[Z^{[2]} = W^{[2]} A^{[1]} + b^{[2]}, \quad A^{[2]} = \sigma(Z^{[2]})\]
  • This avoids for-loops over training examples, making implementations highly efficient.

Activation functions

  • So far, we have primarily used the sigmoid function:
\[\sigma(z) = \frac{1}{1+e^{-z}}\]
  • The following figure shows the sigmoid curve, mapping any real number into the range (0,1).

  • An alternative is the hyperbolic tangent function:

\[\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}\]
  • This function outputs values in (-1,1), centering activations around zero, which often improves training convergence.

  • The following figure shows the \(\tanh(z)\) function, which is a scaled and shifted version of the sigmoid.

  • However, both sigmoid and tanh suffer from vanishing gradients at extreme values. To overcome this, the Rectified Linear Unit (ReLU) is widely used:

\[ReLU(z) = \max(0, z)\]
  • The following figure shows how ReLU grows linearly for positive inputs while setting negative inputs to zero.

  • A further variant is the leaky ReLU, which avoids zeroing out all negative inputs by allowing a small slope:
\[LeakyReLU(z) = \max(0.01z, z)\]
  • The following figure illustrates the leaky ReLU function, where negative values retain a small gradient.

  • In practice:

    • ReLU is most commonly used in hidden layers.
    • Sigmoid or softmax are typically used in the final layer, depending on whether the task is binary or multi-class classification.

Why use nonlinear activation functions?

  • If we were to use only linear activations, then regardless of how many layers we stack, the entire network reduces to a single linear transformation. For example:

    \[a^{[1]} = W^{[1]} x + b^{[1]} \quad a^{[2]} = W^{[2]} a^{[1]} + b^{[2]}\]
  • Expanding this shows that the composition of linear layers is still linear. Thus, nonlinear activations are essential for enabling networks to approximate complex functions.

  • The following figure illustrates a network with only linear activations, which collapses into a single linear model equivalent to logistic regression.

Derivatives of activation functions

  • When training neural networks, backpropagation requires the derivatives of activation functions with respect to their inputs. These derivatives determine how error signals flow backward through the network, guiding weight updates.

Sigmoid derivative

  • Recall the sigmoid function:

    \[\sigma(z) = \frac{1}{1 + e^{-z}}\]
  • Its derivative can be written elegantly in terms of itself:

    \[\sigma'(z) = \sigma(z) \big(1 - \sigma(z)\big)\]
  • This property makes sigmoid derivatives efficient to compute during backpropagation.

Hyperbolic tangent derivative

  • The hyperbolic tangent function is:

    \[\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}\]
  • Its derivative is:

    \[\frac{d}{dz}\tanh(z) = 1 - \tanh^2(z)\]
  • This shows that the gradient diminishes as \(\|z\|\) becomes large (saturation), similar to the sigmoid.

ReLU derivative

  • The ReLU function is:

    \[ReLU(z) = \max(0, z)\]
  • Its derivative is simple:

    \[ReLU'(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{if } z < 0 \end{cases}\]
  • At \(z = 0\), the derivative is undefined, but in practice we assign it either 0 or 1, since the choice rarely affects learning.

Leaky ReLU derivative

  • The leaky ReLU is:

    \[LeakyReLU(z) = \max(0.01z, z)\]
  • Its derivative is:

    \[LeakyReLU'(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0.01 & \text{if } z < 0 \end{cases}\]
  • This avoids the “dying ReLU problem,” where neurons get stuck outputting only 0’s and stop learning.

  • The following figure shows the slope of leaky ReLU in the negative region, preventing gradient death.

Why derivatives matter

  • If the derivative is too small (as in saturated sigmoid/tanh), weight updates vanish, slowing or halting learning (the vanishing gradient problem).
  • If the derivative is too large, updates can explode, destabilizing training (the exploding gradient problem).
  • ReLU and leaky ReLU mitigate vanishing gradients by maintaining strong gradients for positive inputs, which is one reason why they dominate modern deep learning architectures.

Gradient descent for neural networks

  • Training a neural network means adjusting its parameters (weights and biases) to minimize a loss function. This is done using gradient descent, where derivatives computed via backpropagation guide parameter updates.

Cost function

  • For binary classification, a common cost function is the cross-entropy loss:

    \[J(W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]}) = \frac{1}{m} \sum_{i=1}^m L(\hat{y}^{(i)}, y^{(i)})\]
    • with
    \[L(\hat{y}, y) = -\big[y \log(\hat{y}) + (1-y)\log(1-\hat{y})\big]\]
    • where:

      • \(y\) is the true label.
      • \(\hat{y} = a^{[2]}\) is the predicted output of the network.

Gradient descent update rule

  • For each parameter, we update:

    \[W^{[\ell]} := W^{[\ell]} - \alpha \, dW^{[\ell]}, \quad b^{[\ell]} := b^{[\ell]} - \alpha \, db^{[\ell]}\]
    • where \(\alpha\) is the learning rate, and \(dW^{[\ell]}, db^{[\ell]}\) are derivatives of the cost with respect to parameters in layer \(\ell\).

Computational graph perspective

  • We can visualize the flow of information as a computational graph, with forward propagation moving activations forward, and backpropagation moving derivatives backward.

  • The following figure shows a computational graph for logistic regression, where red arrows denote forward propagation and blue arrows denote backpropagation.

  • In a two-layer network, forward propagation computes:

    \[z^{[1]} = W^{[1]}x + b^{[1]}, \quad a^{[1]} = \sigma(z^{[1]})\] \[z^{[2]} = W^{[2]}a^{[1]} + b^{[2]}, \quad a^{[2]} = \sigma(z^{[2]})\]
  • Backpropagation then flows backward using the chain rule:

    \[dZ^{[2]} = a^{[2]} - y\] \[dW^{[2]} = \frac{1}{m} dZ^{[2]} (a^{[1]})^T, \quad db^{[2]} = \frac{1}{m}\sum dZ^{[2]}\] \[dZ^{[1]} = W^{[2]T} dZ^{[2]} \ast g^{[1]'}(Z^{[1]})\] \[dW^{[1]} = \frac{1}{m} dZ^{[1]} X^T, \quad db^{[1]} = \frac{1}{m}\sum dZ^{[1]}\]

    where \(\ast\) denotes elementwise multiplication.

  • The following figure shows the computational graph for a two-layer neural network, with cached intermediate variables \(z^{[1]}\), \(z^{[2]}\), etc., making backward propagation efficient.

Key insights

  • Forward propagation computes activations \(a^{[\ell]}\) layer by layer.
  • Backpropagation uses cached values to compute gradients efficiently.
  • Gradient descent iteratively updates weights until the cost function converges.
  • Together, these steps define the training loop of neural networks.

Random initialization

  • The initialization of weights in a neural network is crucial. If done incorrectly, the network may fail to learn useful representations.

The symmetry breaking problem

  • Suppose we initialize the weight matrix as:
\[W^{[1]} = \begin{bmatrix} 1 & 0 \\ 0 & 2 \end{bmatrix}\]
  • In this case, neurons in the same layer may compute identical functions. For example, if two neurons start with the same weights, their outputs will always be the same, and during training they will receive identical gradients. As a result, they will continue to evolve identically.

  • This makes multiple neurons redundant, reducing the model’s capacity. This issue is called symmetry breaking.

Random small weights

  • To avoid symmetry, we initialize weights randomly:

    \[W^{[1]} = 0.01 \times \text{np.random.randn}(2, 2)\]
  • The factor 0.01 ensures that the initial values are small. This is important because activation functions such as sigmoid and tanh saturate for large inputs, producing near-zero gradients. Starting with small weights keeps activations near zero, where derivatives are meaningful and training progresses faster.

  • Bias terms, on the other hand, do not suffer from symmetry issues. Thus, biases can safely be initialized as zeros:

    \[b^{[1]} = \text{np.zeros}((2,1))\]

Modern initialization strategies

  • In practice, scaling factors better than 0.01 are often used, depending on the activation function:

    • Xavier initialization (Glorot, 2010): designed for sigmoid and tanh activations. Weights are drawn from a distribution with variance proportional to \(1 / n_{\text{in}}\).
    • He initialization (He et al., 2015): designed for ReLU activations. Weights are drawn with variance proportional to \(2 / n_{\text{in}}\).
  • These methods ensure that gradients neither vanish nor explode as they propagate through layers.

  • The following figure illustrates how poor initialization can lead to redundancy between neurons, while random initialization breaks symmetry and enables learning.

Deep neural networks

  • So far, we have worked with shallow networks (1–2 hidden layers). In practice, many modern architectures rely on deep neural networks (DNNs), which stack multiple hidden layers to extract increasingly complex representations.

Notation

  • \(L\): the number of layers in the network.
  • \(n^{[\ell]}\): the number of neurons in layer \(\ell\).
  • \(a^{[\ell]}\): the activations at layer \(\ell\).
  • \(W^{[\ell]}, b^{[\ell]}\): weights and biases at layer \(\ell\).
  • \(z^{[\ell]}\): linear activations before applying \(g^{[\ell]}\).

  • For example, in the following figure, the network has \(L = 4\), with different widths across layers.

Forward propagation in a deep network

  • Generalizing from shallow networks, the propagation step for one training example is:

    \[z^{[\ell]} = W^{[\ell]} a^{[\ell-1]} + b^{[\ell]}\] \[a^{[\ell]} = g^{[\ell]}(z^{[\ell]})\]
    • for \(\ell = 1, \dots, L\).
  • For the entire dataset (matrix form):

    \[Z^{[\ell]} = W^{[\ell]} A^{[\ell-1]} + b^{[\ell]}\] \[A^{[\ell]} = g^{[\ell]}(Z^{[\ell]})\]
    • where, \(A^{[0]} = X\), the input data.

Getting dimensions right

  • One common source of mistakes is mismatching matrix dimensions. As a rule of thumb:

    • If \(a^{[\ell-1]} \in \mathbb{R}^{n^{[\ell-1]} \times m}\), then

      • \[W^{[\ell]} \in \mathbb{R}^{n^{[\ell]} \times n^{[\ell-1]}}\]
      • \[b^{[\ell]} \in \mathbb{R}^{n^{[\ell]} \times 1}\]
      • \[z^{[\ell]}, a^{[\ell]} \in \mathbb{R}^{n^{[\ell]} \times m}\]
  • The following figure illustrates a walkthrough of matrix dimensions in a 2-input, 3-hidden-unit example.

Why deep representations?

  • Deeper layers capture progressively more complex features:

    • Early layers detect low-level structures (edges, curves).
    • Mid-level layers detect parts (eyes, ears).
    • Higher layers detect objects (faces).
  • This hierarchical feature extraction mirrors human perception.

  • The following figure shows how deeper layers analyze increasingly complex features.

Example: logic trees

  • Consider the boolean function:

    \[y = x_1 \lor (x_2 \land x_3)\]
  • A deep tree can represent this efficiently with few nodes, while a shallow tree needs exponentially many nodes to cover all combinations.

  • The following figure compares deep and shallow logic tree representations of the same function: (Left) Few nodes are required for a deeper trees, and (Bottom) Exponentially more nodes are required for a shallow tree, since it must test every possible combination of \(x_1\), \(x_2\), and \(x_3\). Specifically, comparing logic trees for \(y = x_1 ∨ (x_2 ∧ x_3)\) of different depth, it is easier to for deeper trees to represent the same information as shallow trees with less nodes.

  • This demonstrates the expressive efficiency of deeper architectures.

2.3.4 Forward and backward propagation

  • Training deep networks follows the same pattern as shallow ones:

    1. Forward propagation: compute activations layer by layer.
    2. Cache intermediate values: store \(z^{[\ell]}\) and \(a^{[\ell-1]}\) for each layer.
    3. Backward propagation: compute gradients with respect to each parameter using the chain rule.

Forward propagation

  • For layer \(\ell\):

    \[z^{[\ell]} = W^{[\ell]} a^{[\ell-1]} + b^{[\ell]}\] \[a^{[\ell]} = g^{[\ell]}(z^{[\ell]})\]
    • where \(g^{[\ell]}\) is the activation function (sigmoid, ReLU, tanh, etc.).

Backward propagation

  • We input \(dA^{[\ell]}\) (the derivative of the cost with respect to activations at layer \(\ell\)) and compute:

    \[dZ^{[\ell]} = dA^{[\ell]} \ast g^{[\ell]'}(Z^{[\ell]})\] \[dW^{[\ell]} = \frac{1}{m} dZ^{[\ell]} (A^{[\ell-1]})^T\] \[db^{[\ell]} = \frac{1}{m} \sum dZ^{[\ell]}\] \[dA^{[\ell-1]} = (W^{[\ell]})^T dZ^{[\ell]}\]
    • where, \(\ast\) denotes elementwise multiplication.

Two-layer example

  • The following figure shows forward propagation across two layers (top path), with intermediate variables cached, and backward propagation (bottom path) computing parameter updates.

Generalization to \(L\) layers

  • For an \(L\)-layer network:

    • Forward propagation runs sequentially from layer 1 to \(L\).
    • Backward propagation runs in reverse order, from \(L\) down to 1, applying the above formulas.
  • This recursive structure makes implementation modular: one forward function and one backward function per layer type.

Parameters vs. Hyperparameters

  • When training deep networks, it is important to distinguish between parameters and hyperparameters.

Parameters

  • Parameters are the quantities learned during training:

    • Weights \(W^{[\ell]}\)
    • Biases \(b^{[\ell]}\)
  • They are updated at each iteration via gradient descent (or a variant such as Adam, RMSProp, etc.).

Hyperparameters

  • Hyperparameters, on the other hand, are design choices made before training. They control how learning proceeds, but are not directly updated by the training algorithm. Examples include:

    • Learning rate \(\alpha\)
    • Number of layers \(L\)
    • Number of units per layer \(n^{[\ell]}\)
    • Choice of activation functions (ReLU, sigmoid, tanh, etc.)
    • Number of iterations or epochs
    • Batch size
    • Optimization algorithm (SGD, Adam, RMSProp)
    • Regularization strength (e.g., \(\lambda\) in L2 regularization, dropout rates)

Tuning hyperparameters

  • Hyperparameters have a large impact on performance. Since no closed-form solution exists for the “best” settings, they are typically tuned experimentally:

    • Grid search: systematically trying combinations of values.
    • Random search: sampling values from distributions — often more efficient than grid search (Bergstra & Bengio, 2012).
    • Bayesian optimization: probabilistic modeling of hyperparameter space.
    • Automated ML (AutoML) approaches.

Summary

  • Parameters: learned automatically (weights, biases).
  • Hyperparameters: must be chosen manually (learning rate, architecture, etc.).

Connection to the brain

  • The term neural network originates from early attempts to mimic the brain. However, while inspired by neuroscience, modern deep learning architectures are only loosely connected to how biological neurons function.

Biological neurons

  • A biological neuron receives electrical signals from dendrites.
  • If the total input crosses a threshold, the neuron “fires,” sending an output spike through the axon to other neurons.
  • Learning in the brain is thought to occur via mechanisms such as synaptic plasticity, where synapse strengths adapt based on activity.

Artificial neurons

  • In contrast, artificial neurons are simple mathematical abstractions:
\[z^{[\ell]} = W^{[\ell]} a^{[\ell-1]} + b^{[\ell]}, \quad a^{[\ell]} = g^{[\ell]}(z^{[\ell]})\]
  • Inputs are aggregated linearly.
  • An activation function \(g\) (ReLU, sigmoid, tanh, etc.) introduces nonlinearity.
  • Parameters \(W^{[\ell]}, b^{[\ell]}\) are learned via backpropagation — an algorithm with no biological equivalent.

The loose analogy

  • Both systems involve networks of interconnected units passing signals.
  • Both exhibit emergent hierarchical representations: simple features in early layers, complex concepts in deeper layers.
  • However, there is no evidence that the brain computes gradients or runs anything resembling stochastic gradient descent.

Takeaway

  • Neural networks are brain-inspired, not brain-replicating. They borrow terminology and a high-level metaphor, but the actual computational mechanisms differ significantly. As Yann LeCun has emphasized, modern deep learning is a powerful engineering tool, not a neuroscientific model of cognition.

Citation

If you found our work useful, please cite it as:

@article{Chadha2020NeuralNetworks,
  title   = {Neural Networks},
  author  = {Chadha, Aman},
  journal = {Distilled Notes for Stanford CS230: Deep Learning},
  year    = {2020},
  note    = {\url{https://aman.ai}}
}