Full-Cycle Deep Learning Projects

  • Managing a deep learning project is a multidisciplinary process that requires careful attention across the entire lifecycle, from idea conception to long-term maintenance. While academic literature often focuses narrowly on architectures, training algorithms, and benchmark performance, real-world systems demand a broader perspective.
  • As Andrew Ng and others have emphasized in industry Amodei et al., 2016, the system view is often more critical than model novelty.

The Seven Stages of a Project

  • Most projects, regardless of domain, can be divided into seven stages:

    1. Project selection
    2. Data acquisition
    3. Deep learning model design
    4. Model training
    5. Model testing
    6. Deployment
    7. Maintenance
  • These stages are not linear but iterative: data issues discovered during testing may require revisiting data acquisition, deployment challenges may necessitate architectural changes, and maintenance often triggers new rounds of retraining.

Choosing a Project

  • A good project balances interest, feasibility, and impact:

    • Interest: Does the project sustain long-term curiosity and motivation?
    • Impact: How will it affect people’s lives or business outcomes?
    • Data: Are sufficient datasets available, or can they be collected at reasonable cost and speed?
    • Domain knowledge: Does the team bring unique expertise to the task?
  • Feasibility is often the most subtle and crucial consideration. In practice, feasibility can be evaluated using human-level performance as a reference point. If humans can perform a task reliably under the same input conditions, it is reasonable to expect a neural network can approach or surpass that performance given enough data and tuning.

  • The following figure illustrates this principle in the context of trigger word detection, where the overall task is decomposed into modular subtasks. Specifically, the trigger word detection problem is decomposed into two stages: voice activity detection (VAD) and trigger word detection, demonstrating modular system design.

Figure 3.1.1: Breaking a problem involving trigger word detection into two parts: voice activity detection (VAD) and trigger word detection.

Illustrative Examples

  • Two canonical examples from industry practice highlight these principles:

    1. Trigger Word Detection (Amazon Alexa, Siri, etc.):

      • Goal: Detect whether a spoken word such as “Alexa” has been uttered.
      • Decomposition: First detect whether speech is present (voice activity detection, VAD), then determine whether the word matches the trigger.
      • Early Prototyping: Even a small dataset of a few hundred samples is enough to start iterating.
    2. Door Lock Face Recognition:

      • Goal: Unlock a door when the system recognizes the registered user’s face.
      • Framing: A binary classification task — given two images, determine if they belong to the same person.
      • Practical Challenge: Starting with a small dataset (collected over a few days) helps quickly identify real-world pitfalls, such as poor performance under different lighting conditions.
  • Both examples emphasize a general development philosophy: start simple, iterate fast, and document everything.

Practical Considerations and Early Prototyping

  • Small-scale experiments (few hundred to thousand examples) can expose hidden difficulties such as accent variability in speech or lighting mismatches in vision.
  • Simple baselines first: For example, non-machine-learning heuristics for VAD (e.g., amplitude thresholds) or activity detection (e.g., frame differencing for motion detection at a door) often provide robust and energy-efficient baselines before moving to full neural solutions.
  • Documentation: Tracking model architectures, hyperparameters, and results across iterations ensures reproducibility and facilitates error analysis later.

  • This prototype-first, optimize-later philosophy is consistent with agile methods and avoids premature optimization.

Setting up a Machine Learning Application

  • Once a project has been selected, the next step is to set up the machine learning pipeline: partitioning data, establishing metrics, and preparing the system for robust training and evaluation.

Splitting Data into Training, Development, and Test Sets

  • Deep learning practice requires dividing data into subsets that serve different purposes:

  • Training set: used to fit model parameters.
  • Development (validation) set: used to tune hyperparameters and make design choices.
  • Test set: used only at the end, to provide an unbiased estimate of performance.

  • The following figure illustrates this partitioning process of dividing the available data into training, development, and testing sets.

Figure 3.2.1: The training, development, and testing sets in a machine learning application.

Bias, Variance, and Error Analysis

  • Performance must be diagnosed in terms of bias and variance:

  • High bias (underfitting): The model is too simple to capture underlying patterns.
  • High variance (overfitting): The model performs well on training data but poorly on unseen data.
  • Balanced tradeoff: Ideally, both bias and variance are low.

  • Error analysis, guided by these categories, helps determine whether the next step should be more data, more regularization, or a different architecture.

Learning Curves

  • Learning curves are a practical diagnostic tool. By plotting training and development error as a function of training set size, one can identify whether a model is limited by bias or variance.

  • If both training and dev error are high → high bias (underfitting).
  • If training error is low but dev error is high → high variance (overfitting).
  • If both are low → good fit.

  • The following figure illustrates the bias–variance tradeoff in machine learning models. It shows how bias and variance errors affect model performance, with underfitting and overfitting representing opposite ends of the spectrum.

Figure 3.2.2: Bias and variance tradeoffs in supervised learning.

  • The following figure shows example of a model having both high variance and high bias.

Figure 3.2.3: High variance and high bias.

Regularizing a Neural Network

  • When training deep networks, a common challenge is overfitting, where the model performs well on training data but poorly on unseen data. Regularization introduces constraints or noise to prevent overfitting and improve generalization.

Regularization

  • Consider logistic regression with a cost function:
\[J(w, b) = \frac{1}{m} \sum_{i=1}^m L(\hat{y}^{(i)}, y^{(i)})\]
  • Adding an L2 regularization term yields:
\[J(w, b) = \frac{1}{m} \sum_{i=1}^m L(\hat{y}^{(i)}, y^{(i)}) + \frac{\lambda}{2m} \|w\|_2^2\]
  • This modification penalizes large weights, effectively discouraging overly complex models. L1 regularization is another option, which encourages sparsity.

  • For multi-layer networks, the regularized cost function generalizes to:

\[J(W^{[1]}, b^{[1]}, ..., W^{[L]}, b^{[L]}) = \frac{1}{m} \sum_{i=1}^m L(\hat{y}^{(i)}, y^{(i)}) + \frac{\lambda}{2m} \sum_{\ell=1}^L \|W^{[\ell]}\|_F^2\]

Why Regularization Reduces Overfitting

  • If the weights connected to a neuron are very small, the neuron’s contribution is negligible, simplifying the model.

  • The following figure shows how reducing the magnitude of weights diminishes the influence of a neuron, effectively simplifying the hypothesis class and reducing overfitting risk.

Figure 3.3.1: If the weights connected to a neuron are near 0, the neuron contributes little to the hypothesis, making the model simpler.

  • When weights shrink, activations often become linear, reducing model complexity. For example, the hyperbolic tangent behaves nearly linearly around zero.

  • The following figure illustrates how the hyperbolic tangent function is nearly linear around the origin, making it less expressive when weights are small.

Figure 3.3.2: The hyperbolic tangent function is nearly linear near 0.

Dropout Regularization

  • Dropout randomly removes hidden neurons during training, forcing the network to not rely too heavily on any single feature (Srivastava et al., 2014).

  • The following figure shows a fully connected 4-layer neural network before dropout is applied.

Figure 3.3.3: A 4-layer neural network with 3 hidden layers of 4 neurons each.

  • During dropout, some neurons are removed, leaving only a subset active.

  • The following figure illustrates the same neural network after dropout, where several hidden neurons are removed.

Figure 3.3.4: A simplified neural network after dropout removes some hidden neurons.

  • At test time, all neurons are used, but their outputs are scaled to account for dropout during training.

Other Regularization Methods

  • When new data is unavailable, data augmentation artificially increases training diversity. For images, this can involve rotations, crops, color shifts, and reflections.

  • The following figure demonstrates examples of data augmentation techniques such as rotation, translation, and color perturbations applied to image inputs.

Figure 3.3.5: Data augmentation examples applied to images.

  • Another method is early stopping, where training halts before overfitting occurs, trading off full convergence for generalization.

Setting Up an Optimization Problem

  • Optimization is at the heart of training neural networks. Good optimization practices help avoid slow convergence, exploding/vanishing gradients, or poor generalization.

Normalizing Inputs

  • To speed up gradient descent, we normalize inputs by centering the mean at 0 and setting variance to 1.

  • The following figure shows the original distribution of features before normalization, with mean centered at 5 and variance 2.25.

Figure 3.4.1: Normal distribution of data centered at 5 with variance 2.25.

  • The following figure illustrates the same dataset after centering, shifting the mean to 0.

Figure 3.4.2: Centered data with mean shifted to 0.

  • The following figure shows the dataset after scaling variance to 1, completing the normalization process.

Figure 3.4.3: Normalized data after centering and variance scaling.

  • The following figure compares contour plots for gradient descent on unnormalized vs normalized data, demonstrating that normalization creates more circular contours and accelerates convergence.

Figure 3.4.4: Contour plots for normalized (left) vs unnormalized (right) data.

Vanishing and Exploding Gradients

  • Deep networks can suffer from extremely small or large gradients depending on weight initialization.

  • The following figure illustrates a 7-layer neural network, where signal propagation across many layers can lead to vanishing or exploding gradients if initialization is poorly chosen.

Figure 3.4.5: 7-layer neural network.

  • If weights are too small \((<1)\), activations vanish across layers; if too large \((>1)\), they explode, destabilizing training.

Weight Initialization for Deep Neural Networks

  • To prevent vanishing/exploding gradients, weights are initialized with variance depending on the number of inputs:

  • Xavier initialization (Glorot & Bengio, 2010) for tanh:

    \[\sigma^2 = \frac{1}{n^{[\ell-1]}}\]
  • He initialization (He et al., 2015) for ReLU:

    \[\sigma^2 = \frac{2}{n^{[\ell-1]}}\]

Gradient Numerical Approximations

  • Numerical gradient checks use finite differences to approximate derivatives.

  • One-sided limit:

\[f'(\theta) = \lim_{\epsilon \to 0} \frac{f(\theta + \epsilon) - f(\theta)}{\epsilon}\]
  • Two-sided limit (preferred):
\[f'(\theta) = \lim_{\epsilon \to 0} \frac{f(\theta + \epsilon) - f(\theta - \epsilon)}{2\epsilon}\]
  • The following figure visualizes the two-sided derivative approximation using finite differences.

Figure 3.4.6: Two-sided limit derivative definition.

Gradient Checking

  • Gradient checking compares backpropagated derivatives with numerical approximations to detect bugs. Regularization terms must be included, but dropout cannot be used due to randomness.

Optimization Algorithms

  • Efficient optimization methods allow neural networks to converge faster and avoid poor local regions. This section covers gradient descent variants and adaptive optimization techniques.

Mini-Batch Gradient Descent

  • Instead of using all training examples (batch gradient descent), we split the data into mini-batches to speed up training and improve generalization.

  • The following figure compares the cost progression for batch gradient descent (smooth convergence) and mini-batch gradient descent (noisier, but faster).

Figure 3.5.1: Cost progression for batch gradient descent (left) and mini-batch gradient descent (right).

  • Mini-batch size is a hyperparameter. Extreme cases:

    • Mini-batch = \(m\) → batch gradient descent
    • Mini-batch = 1 → stochastic gradient descent

Exponentially Weighted Averages

  • Exponentially weighted averages smooth noisy signals by weighting recent values more heavily.

  • The following figure shows the raw dataset that exponential smoothing will be applied to.

Figure 3.5.2: Dataset for exponentially weighted average.

  • The following figure shows the exponentially weighted average (red curve), which smooths the dataset.

Figure 3.5.3: Exponentially weighted average curve as a joined plot (red).

  • The following figure demonstrates how different values of \(\beta\) affect smoothness, with \(\beta = 0.01\) (green), 0.6 (purple), and 0.95 (orange).

Figure 3.5.4: \beta = 0.01 (green), 0.6 (purple), 0.95 (orange).

  • The following figure illustrates the effect of weighting, where scaling factors decrease exponentially with time.

Figure 3.5.5: Exponential decay scaling factors applied to data.

  • The following figure compares the approximation of the time constant \(\tau = \frac{1}{1 - \beta}\) (red) with the real time constant \(\tau = \log_{\beta}{e}\) (blue).

Figure 3.5.6: Approximation vs real time constant for exponential decay.

Bias Correction for Exponentially Weighted Averages

  • Initially, exponential averages underestimate values because early terms are biased towards zero. Bias correction rescales the averages for better accuracy.

  • The following figure compares bias-corrected weighted averages (purple) against uncorrected ones (red).

Figure 3.5.8: Bias-corrected weighted average (purple) vs uncorrected (red).

Gradient Descent with Momentum

  • Momentum accelerates learning in relevant directions and dampens oscillations.

  • The following figure shows the contour plot of gradient descent paths with momentum, highlighting smoother and faster convergence compared to plain gradient descent.

Figure 3.5.9: Contour plot showing path of gradient descent with momentum.

RMSProp

  • RMSProp rescales gradients by their recent magnitudes, reducing oscillations and stabilizing training.

  • The following figure illustrates oscillations in gradient descent due to imbalanced gradient magnitudes across parameters.

Figure 3.5.10: Oscillation in gradient directions (bias vs weight).

Adam Optimization Algorithm

  • Adam (Kingma & Ba, 2015) combines RMSProp and momentum. Default hyperparameters include:

    • $$\beta_1 = 0.9$
    • $$\beta_2 = 0.999$
    • $$\epsilon = 10^{-8}$
  • Adam adapts learning rates dynamically for each parameter, making it one of the most widely used optimizers today.

Learning Rate Decay

  • Large steps near minima prevent convergence. Learning rate decay helps by gradually reducing the step size during training.

  • The following figure shows how gradient descent may bounce around the minimum if learning rate is not decayed.

Figure 3.5.11: Gradient descent may take large steps near the minimum, bouncing around without converging.

Local Optima

  • In high-dimensional parameter spaces, local minima are rare. Instead, saddle points and flat plateaus are more common challenges for optimization. This explains why modern optimization algorithms focus on accelerating escape from such plateaus rather than avoiding local minima.

Hyperparameter Tuning

  • Hyperparameter tuning is essential in deep learning, since performance depends strongly on values like learning rate, momentum, hidden units, and mini-batch size. The search for good hyperparameters can significantly influence training efficiency and final accuracy.

Tuning Process

  • Hyperparameters should be explored systematically. Instead of grid search, random search is often more effective (Bergstra & Bengio, 2012), since it covers more values in high-dimensional spaces.

  • The following figure compares grid search (left), which samples evenly but inefficiently, with random search (right), which provides better coverage of the hyperparameter space.

Figure 3.6.1: (a) Grid search vs (b) random search for hyperparameters.

  • A coarse-to-fine strategy allows researchers to focus search resources on promising regions after broad exploration.

  • The following figure shows the coarse-to-fine narrowing process, where initial random samples guide subsequent fine-grained search in relevant ranges.

Figure 3.6.2: Coarse-to-fine narrowing in hyperparameter search.

Using an Appropriate Scale

  • Some hyperparameters, like learning rate, vary over several orders of magnitude. Therefore, it is better to sample on a logarithmic scale rather than a linear scale.

  • The following figure compares linear scale binning (top), which wastes resolution in unimportant regions, with logarithmic scale binning (bottom), which better captures orders of magnitude.

Figure 3.6.3: Linear scale binning (top) vs logarithmic scale binning (bottom).

  • For parameters close to 1 (e.g., momentum terms \(\beta$), tuning is often done on\)1 - \beta\(rather than\)\beta$$ directly, since small changes near 1 can make large differences.

Hyperparameter Tuning in Practice: Pandas vs Caviar

  • Two common approaches:

    • Pandas approach: Carefully adjust hyperparameters while training one model at a time (useful in resource-constrained settings).
    • Caviar approach: Train many models in parallel with different hyperparameters and select the best-performing one (requires substantial compute).

Batch Normalization

  • Batch normalization improves training stability by normalizing activations within the network. This technique has been shown to accelerate convergence, reduce sensitivity to initialization, and sometimes even reduce the need for other regularization methods (Ioffe & Szegedy, 2015).

Normalizing Activations in a Network

  • Traditionally, we normalize inputs \(a^{[0]}\) before training parameters \((W^{[1]}, b^{[1]})\). Batch normalization extends this idea by normalizing the linear activations \(z^{[\ell]}\) within hidden layers.

  • For a mini-batch of activations

\[\{z^{ }, z^{ }, \dots, z^{[\ell](assets/full-cycle-dl/m)}\},\]
  • we compute the mean and variance:
\[\mu^{[\ell]} = \frac{1}{m} \sum_{i=1}^{m} z^{[\ell](assets/full-cycle-dl/i)}\] \[\sigma^{2[\ell]} = \frac{1}{m} \sum_{i=1}^{m} \left( z^{[\ell](assets/full-cycle-dl/i)} - \mu^{[\ell]} \right)^2\]
  • Normalization step:

    \[z_{\text{norm}}^{[\ell](assets/full-cycle-dl/i)} = \frac{z^{[\ell](assets/full-cycle-dl/i)} - \mu^{[\ell]}}{\sqrt{\sigma^{2[\ell]} + \epsilon}}\]
    • with small constant \(\epsilon \approx 10^{-8}\) to avoid division by zero.

    • Finally, we apply scale and shift:

    \[\tilde{z}^{[\ell](assets/full-cycle-dl/i)} = \gamma^{[\ell]} z_{\text{norm}}^{[\ell](assets/full-cycle-dl/i)} + \beta^{[\ell]}\]
    • where \(\gamma^{[\ell]}\) and \(\beta^{[\ell]}\) are learnable parameters.
  • The following figure shows how each node now incorporates three computations: the linear activation, batch normalization, and the non-linear activation.

Figure 3.7.1: With batch normalization each node computes linear activation, batch-normalized activation, and non-linear activation.

Fitting Batch Norm in a Neural Network

  • For layer \(\ell\), the sequence of operations becomes:
\[z^{[\ell]} = W^{[\ell]} a^{[\ell-1]} + b^{[\ell]}\] \[\tilde{z}^{[\ell]} = \gamma^{[\ell]} \frac{z^{[\ell]} - \mu^{[\ell]}}{\sqrt{\sigma^{2[\ell]} + \epsilon}} + \beta^{[\ell]}\]
  • Notice that \(b^{[\ell]}\) becomes redundant, since the shift is absorbed by \(\beta^{[\ell]}\).

  • Thus, the parameters per layer reduce to \(W^{[\ell]}, \gamma^{[\ell]}, \beta^{[\ell]}\).

  • During gradient descent, updates are:

    \[W^{[\ell]} := W^{[\ell]} - \alpha \, dW^{[\ell]}\] \[\gamma^{[\ell]} := \gamma^{[\ell]} - \alpha \, d\gamma^{[\ell]}\] \[\beta^{[\ell]} := \beta^{[\ell]} - \alpha \, d\beta^{[\ell]}\]
    • where \(\alpha\) is the learning rate.

Batch Norm at Test Time

  • At test time, we cannot compute \(\mu^{[\ell]}\) and \(\sigma^{2[\ell]}\) from the batch, since inference is often on single samples. Instead, we use exponentially weighted averages estimated during training.

  • For test sample $$z^{\ell}$:

    \[z_{\text{norm}}^{[\ell](assets/full-cycle-dl/i)} = \frac{z^{[\ell](assets/full-cycle-dl/i)} - \mu^{[\ell]}_{\text{EMA}}}{\sqrt{\sigma^{2[\ell]}_{\text{EMA}} + \epsilon}}\]
    • where \(\mu^{[\ell]}_{\text{EMA}}, \sigma^{2[\ell]}_{\text{EMA}}\) are moving averages across mini-batches during training.

Multi-class Classification

  • So far, we have focused on binary classification problems. Many practical problems, however, require distinguishing among multiple classes (e.g., recognizing digits, animals, or objects). Neural networks extend naturally to multi-class classification.

Softmax Regression

  • Suppose we have \(C\) classes. The final layer of the neural network will have \(C\) output neurons, each corresponding to the probability of one class.

  • The following figure illustrates a multi-class classification setup where labels map to categories: 0 → cat, 1 → dog, 2 → lion, 3 → zebra.

Figure 3.8.1: Multi-class classification with labels 0 → cat, 1 → dog, 2 → lion, 3 → zebra.

  • For input \(X\), the last layer produces logits \(z^{[L]}\). We exponentiate each component:
\[t_i = e^{z^{[L]}_i}, \quad i \in \{1, \dots, C\}\]
  • Then normalize to obtain probabilities:

    \[a^{[L]}_i = \frac{t_i}{\sum_{j=1}^C t_j}\]
    • where the vector \(a^{[L]}\) contains all class probabilities.
  • The following figure shows how each neuron in the final layer corresponds to the probability of one class, forming a probability distribution across categories.

Figure 3.8.2: Each neuron in the final layer corresponds to the probability of a class.

Example

  • Suppose:
\[z^{[L]} = \begin{bmatrix} 5 \\ 2 \\ -1 \\ 3 \end{bmatrix}\]
  • Exponentiating gives:
\[t = \begin{bmatrix} e^5 \\ e^2 \\ e^{-1} \\ e^3 \end{bmatrix} \approx \begin{bmatrix} 148.4 \\ 7.4 \\ 0.4 \\ 20.1 \end{bmatrix}\]
  • The sum is:
\[\sum_{j=1}^{4} t_j \approx 176.3\]
  • So the probabilities are:
\[a^{[L]} \approx \begin{bmatrix} 0.842 \\ 0.042 \\ 0.002 \\ 0.114 \end{bmatrix}\]
  • Thus, the model predicts class 0 (cat) with highest probability.

Training a Softmax Classifier

  • For a given true label \(y\), the loss is defined as the cross-entropy loss (Bishop, 2006):
\[L(\hat{y}, y) = - \sum_{j=1}^C y_j \log \hat{y}_j\]
  • This penalizes the model heavily if it assigns low probability to the correct class.

  • For training, we use the average across \(m\) training examples:

\[J = \frac{1}{m} \sum_{i=1}^m L(\hat{y}^{(i)}, y^{(i)})\]
  • Backpropagation through the softmax layer yields a simple form:

    \[\frac{\partial J}{\partial z^{[L]}} = \hat{y} - y\]
    • which is similar in spirit to binary logistic regression.

Citation

If you found our work useful, please cite it as:

@article{Chadha2020OptimizingStructuringNNs,
  title   = {Optimizing and Structuring Neural Networks},
  author  = {Chadha, Aman},
  journal = {Distilled Notes for Stanford CS230: Deep Learning},
  year    = {2020},
  note    = {\url{https://aman.ai}}
}