Overview

  • In previous chapters, we introduced the fundamental building blocks of neural networks and applied them directly to vectorized image data. While such models can capture simple patterns, they do not scale well to large, high-dimensional images. For example, a $1000 \times 1000 \times 3$ image has 3 million input features. Feeding such an input directly into a fully connected layer would require millions of parameters, leading to both computational intractability and a severe risk of overfitting.

  • To overcome these limitations, convolutional neural networks (CNNs) introduce a new paradigm for processing image data. Instead of treating every pixel independently, CNNs exploit two key properties of natural images:

  1. Local spatial correlation

    • Nearby pixels are strongly correlated and collectively form meaningful patterns such as edges, corners, and textures.
    • By restricting connections to local neighborhoods, CNNs capture these local structures efficiently.
  2. Translation invariance

    • Objects can appear at different positions within an image, but they share similar visual features.
    • CNNs leverage parameter sharing, where the same filter is applied across all positions, allowing feature detectors to generalize throughout the image.

Core principles of CNNs

  • At the heart of CNNs are three essential ideas:

    • Sparse connectivity: Each neuron connects only to a small local region of the input (the receptive field), unlike in fully connected layers.
    • Parameter sharing: The same convolutional filter (weights) is used across the entire input, drastically reducing the number of learnable parameters.
    • Hierarchical feature learning: Stacking multiple layers allows the network to build increasingly abstract features, from simple edges to textures to object parts, and finally to entire objects.

Motivation for CNN layers

  • CNNs are composed of specialized layers that each play a role in feature extraction and representation:

    • Convolutional layers apply learned filters to detect patterns such as edges and textures.
    • Activation functions (e.g., ReLU) introduce non-linearity, allowing networks to learn complex mappings.
    • Pooling layers downsample feature maps to reduce dimensionality and encourage robustness to small translations.
    • Fully connected layers (typically near the output) combine abstract features for classification or regression tasks.
  • This modular structure enables CNNs to achieve state-of-the-art performance in tasks such as image recognition, object detection, and segmentation.

Edge detection

  • Edge detection is a foundational operation in both classical image processing and modern convolutional neural networks (CNNs). Intuitively, edges correspond to spatial locations where the image intensity varies rapidly; mathematically, these are regions where spatial derivatives of the image are large in magnitude. Detecting such structure is useful because edges delineate object boundaries, reveal texture, and provide robust, low-level cues that downstream models can exploit.

  • From a signal-processing viewpoint, an image is a discrete function $I:\mathbb{Z}^2 \to \mathbb{R}$ (grayscale) or $\mathbb{R}^3$ (RGB). A linear, shift-invariant operator on images can be implemented by a discrete convolution with a small filter (kernel) $K\in\mathbb{R}^{f\times f}$. For a 2D image and a single filter, the valid discrete convolution at location $(i,j)$ is

    \[(S = I \ast K)\,(i,j) \;=\; \sum_{u=0}^{f-1}\sum_{v=0}^{f-1} I(i+u,\,j+v)\,K(u,v),\]
    • where we use computer-vision convention (no kernel flip) so this is strictly cross-correlation; classical convolution flips $K$ both horizontally and vertically. In practice, the distinction is immaterial for learning because filters are learned jointly with the sign/orientation convention; nonetheless, it is good to be precise. With an input of spatial size $n_h\times n_w$ and a square filter of size $f\times f$, the spatial size of the valid output is $(n_h-f+1)\times (n_w-f+1)$.

Why do filters detect edges?

  • If we approximate spatial derivatives by finite differences, then small, signed filters that compute differences between neighboring pixels produce large responses on intensity transitions and near-zero responses in flat regions. Classical, hand-crafted derivative filters such as Prewitt, Sobel, and Scharr implement this idea and differ chiefly in how they weight central versus peripheral pixels to trade off noise suppression for localization Prewitt (1970), Sobel and Feldman (1968), Scharr (2000).

  • The following figure introduces the local computation performed by a 2D filter: a sliding, elementwise product between the filter coefficients and the underlying image patch, followed by a sum. Larger (positive or negative) sums indicate stronger local agreement with the pattern encoded by the filter.

Figure 5.1.1: A convolution over a matrix is the element-wise product between a filter and smaller entries of the matrix. The value from the convolution between the filter and the upper left-hand corner of the matrix corresponds to the top left entry in the result of the convolution.

  • The following figure catalogs common vertical and horizontal edge-detection filters. Each kernel emphasizes a different derivative approximation and smoothing scheme; for instance, Sobel incorporates a central row/column weighting that improves noise robustness relative to a simple difference, while Scharr’s coefficients are optimized for better rotational symmetry and isotropy.

Figure 5.1.3: Different types of horizontal and vertical filters that are used for edge detection

  • The following figure illustrates the qualitative effect of applying horizontal and vertical edge filters to a natural image. Dark pixels indicate strong positive responses (edges aligned with the filter’s preferred orientation), white pixels indicate strong negative responses (edges with opposite orientation), and mid-gray indicates weak or no detected edge. Such responses are often combined (e.g., by magnitude $\sqrt{G_x^2 + G_y^2}$) or thresholded to produce binary edge maps as in the classical Canny detector Canny (1986).

Figure 5.1.2: Resulting images after applying a horizontal and vertical filter

Practical considerations

  1. Normalization and dynamic range. Because convolution outputs are sums of products, their magnitudes can vary widely. It is common to normalize filter responses (e.g., by dividing by a constant or applying batch/instance normalization in a CNN) to stabilize downstream processing.

  2. Orientation selectivity. A single vertical (or horizontal) derivative kernel responds strongly to edges with that orientation. To capture edges at arbitrary orientations, one can use a bank of oriented filters (e.g., rotations of Sobel/Scharr), or compute gradient magnitude and orientation via $G_x$ and $G_y$, where $G_x = I \ast K_x$ and $G_y = I \ast K_y$. The gradient magnitude $| \nabla I |_2 = \sqrt{G_x^2 + G_y^2}$ and orientation $\theta = \arctan2(G_y, G_x)$ summarize edge strength and direction.

  3. Nonlinearity and thresholding. Classical pipelines apply non-maximum suppression and hysteresis thresholding (as in Canny (1986)) to thin edges and maintain connectivity. In CNNs, instead of hand-designed post-processing, learned nonlinearities (e.g., ReLU) and later layers learn to utilize raw filter responses.

  4. From hand-crafted to learned filters. In CNNs, filters are not fixed Sobel or Scharr kernels; they are learned via gradient descent to minimize a task loss. Nevertheless, it is a recurring empirical observation that the first convolutional layer of trained CNNs often learns edge- and orientation-selective filters reminiscent of Gabor-like derivatives LeCun et al. (1998), Krizhevsky et al. (2012), Simonyan and Zisserman (2015).

  5. Computational cost. For an input of size $n_h\times n_w$ and $C$ channels, convolving with $C’$ filters of spatial size $f\times f$ costs $O(n_h n_w \, C f^2 C’)$ multiply–adds (for unit stride and same padding). This is a key motivator for architectural choices later in the chapter (e.g., 1×1 convolutions and separable filters) that reduce cost while preserving representational power.

Mathematical summary

  • Given grayscale input $I\in\mathbb{R}^{n_h\times n_w}$, a filter $K\in\mathbb{R}^{f\times f}$, padding $p$, and stride $s$, the output spatial size is:
\[n_h^{\text{out}} \;=\; \left\lfloor \frac{n_h + 2p - f}{s} \right\rfloor + 1,\qquad n_w^{\text{out}} \;=\; \left\lfloor \frac{n_w + 2p - f}{s} \right\rfloor + 1.\]
  • For RGB input $I\in\mathbb{R}^{n_h\times n_w\times 3}$ and a filter bank $K\in\mathbb{R}^{f\times f\times 3\times C’}$, each output channel $c’$ is:
\[S(\cdot,\cdot,c') \;=\; \sum_{c=1}^{3} I(\cdot,\cdot,c) \ast K(\cdot,\cdot,c,c').\]
  • When using horizontal and vertical derivative filters $K_x$ and $K_y$ (e.g., Sobel), define:
\[G_x = I \ast K_x,\qquad G_y = I \ast K_y,\qquad \|\nabla I\|_2 = \sqrt{G_x^2 + G_y^2},\qquad \theta = \arctan2(G_y, G_x).\]
  • These quantities underlie classical edge maps and also provide intuition for what the earliest layers of CNNs are learning.

Connections to learning

  • In learned CNNs, the objective is to optimize filter coefficients ${K}$ by minimizing a task loss $\mathcal{L}(\Theta)$ over parameters $\Theta$. The gradient of the loss with respect to $K$ is obtained via backpropagation, which uses the fact that convolution is linear and its adjoint is correlation with the flipped kernel. This makes derivative computation efficient and enables end-to-end learning of edge-, texture-, and part-selective filters tuned to the dataset and task.

Takeaways

  • We formalized edge detection as local, linear operations implemented by small spatial filters. We reviewed classical derivative filters (Prewitt, Sobel, Scharr) and connected their behavior to gradient magnitude and orientation. We previewed how CNNs generalize these fixed filters by learning them from data, while retaining the same computational building block: discrete convolution (or correlation) with small kernels.

  • The following figure highlights the mechanics of patchwise multiply–accumulate; the second figure catalogs standard edge kernels; the third shows their qualitative effect on a real image. All three figures will reappear conceptually when we discuss learned filters in the first layer of CNNs.

  • The following figure demonstrates the sliding-window multiply–accumulate that produces each output pixel from a local image patch and a filter; this is the atomic computation that underlies every convolutional layer. Put simply, the value from the convolution between the filter and the upper left-hand corner of the matrix corresponds to the top left entry in the result of the convolution.

  • The following figure presents a toolbox of vertical and horizontal edge-detection kernels (standard, Sobel, and Scharr), each offering a different trade-off between noise suppression and localization of the edge response.

  • The following figure compares the original image to the responses of horizontal and vertical edge detectors, illustrating how positive and negative filter responses correspond to opposite edge orientations and how uniform regions yield near-zero responses.

Padding

  • As we stack multiple spatial convolutions, unpadded (valid) operations progressively shrink the spatial extent of the feature maps. This erosion complicates deep architectures, discards boundary information, and makes it difficult to align activations across layers. Padding remedies these issues by augmenting the input with additional border pixels before applying the convolution. The standard choice in modern CNNs is zero-padding, though other boundary conditions (reflect, replicate) are sometimes preferable to reduce edge artifacts Dumoulin & Visin (2016), Goodfellow, Bengio, Courville (2016).

Output sizing with padding and stride

  • Consider an input $a \in \mathbb{R}^{n_h \times n_w \times n_c}$ convolved with a bank of $n_c’$ filters $W \in \mathbb{R}^{f \times f \times n_c \times n_c’}$ using integer stride $s$ and symmetric zero padding of width $p$ on all sides. The output spatial dimensions are
\[n_h^{\text{out}} \;=\; \left\lfloor \frac{n_h + 2p - f}{s} \right\rfloor + 1,\qquad n_w^{\text{out}} \;=\; \left\lfloor \frac{n_w + 2p - f}{s} \right\rfloor + 1.\]
  • When $s=1$ and we desire to preserve spatial size (the common same convolution), solve $n_h^{\text{out}}=n_h$ for $p$:
\[n_h = \left\lfloor n_h + 2p - f \right\rfloor + 1 \;\Longrightarrow\; p = \frac{f-1}{2}.\]
  • Thus, same convolution requires odd $f$ so that $p$ is integral; e.g., $f=3 \Rightarrow p=1$, $f=5 \Rightarrow p=2$.

Why padding matters in deep nets

  1. Boldface Spatial alignment across depth. With $p=\tfrac{f-1}{2}$ and $s=1$, all layers share the same spatial grid, simplifying skip connections (as in ResNets) and multi-scale fusion (as in U-Net) without ad hoc cropping.

  2. Boldface Information preservation at boundaries. Valid convolutions discard border pixels that participate in fewer receptive fields. Padding ensures boundary evidence influences activations at early depths instead of being systematically underrepresented.

  3. Boldface Effective receptive field control. With padding, the nominal receptive field after $L$ same convolutions of sizes $f_1,\dots,f_L$ with unit stride is:

    \[R_L \;=\; 1 + \sum_{\ell=1}^{L} (f_\ell - 1),\]
    • growing linearly with depth while keeping feature-map size fixed.
  4. Boldface Compatibility with pooling/striding. When later layers downsample (via strides or pooling), preserving size beforehand yields cleaner divisibility and avoids off-by-one effects.

Choices of boundary handling

  • Let $B$ denote the padded band. Common schemes include:
\[\text{zero (constant)}:\; B=0;\quad \text{reflect}:\; B(i)=a(\text{mirror}(i));\quad \text{replicate}:\; B(i)=a(\text{clamp}(i)).\]
  • Zero-padding is ubiquitous in classification CNNs; reflect/replicate can reduce halo artifacts in dense prediction tasks (e.g., segmentation, super-resolution). Some libraries also support circular padding, which corresponds to a discrete torus and matches convolution in the discrete Fourier domain.

Connection to gradient flow

  • During backpropagation, padding influences how gradients accumulate near borders. With zero-padding, gradients at border positions depend only on interior activations that used the padded band; reflect/replicate propagate boundary information more symmetrically. While learned networks typically adapt to the chosen scheme, consistent padding choices across training and inference are important to avoid distribution shift Dumoulin & Visin (2016).

  • The following figure explains how adding zeros around the image recovers the original spatial size after convolution, illustrating the same-convolution setting where $p=\tfrac{f-1}{2}$.

Figure 5.2.1: Adding padding to our image could produce the output in the same size as our input

  • Worked example: preserving size with $f=3$:

    • Let $n_h=n_w=H$, $f=3$, $s=1$, $p=1$. Then,
    \[n^{\text{out}} = \left\lfloor \frac{H + 2\cdot 1 - 3}{1} \right\rfloor + 1 = \left\lfloor H - 1 \right\rfloor + 1 = H,\]
    • confirming size preservation. Stacking $L$ such layers keeps the map $H\times H$ while increasing the effective receptive field to $R_L = 1 + 2L$.

Practical notes

  1. Boldface Odd kernels are convenient. Odd $f$ centers the kernel on a distinct pixel, simplifies padding to $p=\tfrac{f-1}{2}$, and yields symmetric receptive fields.

  2. Boldface Padding and normalization. When using BatchNorm or LayerNorm, be aware that zero-padded bands can slightly bias statistics near borders; large batches and random cropping mitigate this.

  3. Boldface Implementation detail. Many frameworks implement cross-correlation (no kernel flip) but still expose a “convolution” API. Shapes and padding formulas above remain valid as written in computer-vision convention.

Strided Convolutions

  • Thus far, we have assumed stride length $s=1$: the convolution kernel moves one pixel at a time horizontally or vertically across the input. Strided convolution generalizes this by moving the filter by steps of $s>1$. This modification reduces the spatial resolution of the output feature maps while enlarging the effective receptive field, much like downsampling or subsampling in classical signal processing Dumoulin & Visin (2016).

Mathematical formulation

  • Given input size $n_h \times n_w$, padding $p$, filter size $f$, and stride $s$, the output feature-map dimensions are
\[n_h^{\text{out}} = \left\lfloor \frac{n_h + 2p - f}{s} \right\rfloor + 1, \qquad n_w^{\text{out}} = \left\lfloor \frac{n_w + 2p - f}{s} \right\rfloor + 1.\]
  • When $s=2$, each output unit corresponds to a $2\times 2$ block of disjoint input pixels, and so on. Thus strided convolutions simultaneously extract features and reduce resolution.

Interpretation

  1. Boldface Connection to pooling. A stride-$s$ convolution with filter size $f$ and learned weights can be seen as a learned form of downsampling. Unlike max pooling or average pooling (which have fixed aggregation functions), strided convolutions learn the aggregation via filter parameters.

  2. Boldface Computational advantage. Larger strides reduce the number of sliding positions and thus the multiply–add count. For example, with stride $s=2$, the number of convolution positions is reduced by roughly a factor of four compared to stride 1.

  3. Boldface Enlarged receptive field. Because each successive layer covers more of the original input, strided convolutions allow deeper networks to capture long-range dependencies without an explosion in filter size.

Constraints and valid filters

  • When using stride $s > 1$, only those filter placements that fully lie inside the padded input are considered. Partial overlaps at the boundary are not permitted. This restriction means certain configurations of input size, filter size, and stride can leave unused border pixels. In practice, architectures are designed so that dimensions divide evenly (e.g., with powers of 2).

  • The following figure illustrates this restriction: filters that run past the boundary are invalid and are excluded from computation. Only complete filter placements that fit entirely within the image grid contribute to the output feature map.

Figure 5.3.1: An invalid filter since we will not use the filters that run over the side of the image, instead we will only use filters that fill a complete $f\times f$ grid

Worked example

  • Suppose an input of size $7\times 7$, with $f=3$, $p=0$, $s=2$. Then,
\[n_h^{\text{out}} = \left\lfloor \frac{7 - 3}{2} \right\rfloor + 1 = \left\lfloor 2 \right\rfloor + 1 = 3.\]
  • Thus the output is $3 \times 3$. Compared to the $5 \times 5$ output that would result with stride 1, we have reduced resolution while maintaining feature extraction.

Connections to modern architectures

  • In early CNNs (e.g., LeNet, AlexNet), stride-2 convolutions often replaced pooling in later designs.
  • Strided convolutions appear in generative models as well, though typically in transposed form (a.k.a. fractionally strided convolutions or deconvolutions).
  • Many state-of-the-art architectures alternate between stride-1 convolutions for feature extraction and stride-2 convolutions for resolution reduction, mimicking the pyramid-like progression from fine detail to coarse semantics.

Cross-Correlation vs. Convolution

  • Up to now, we have described the convolution operation in the computer-vision sense: sliding a filter across the image, multiplying, and summing without flipping the filter. Strictly speaking, this operation is cross-correlation, not convolution, according to the definition in classical signal processing.

Classical convolution

  • For a 2D image $I$ and a filter $K$ of size $f \times f$, the mathematical convolution is defined as
\[(S = I \ast K)(i,j) \;=\; \sum_{u=0}^{f-1} \sum_{v=0}^{f-1} I(i+u,\,j+v)\, K(f-1-u,\, f-1-v).\]
  • Notice that the filter $K$ is flipped both horizontally and vertically before multiplication. This flipping is intrinsic to the definition of convolution in linear systems and ensures certain algebraic properties, such as associativity and commutativity with respect to shifts, which are essential in Fourier analysis.

Cross-correlation

  • In contrast, the operator typically used in computer vision (and implemented in deep learning frameworks such as TensorFlow and PyTorch) is:

    \[(S = I \star K)(i,j) \;=\; \sum_{u=0}^{f-1} \sum_{v=0}^{f-1} I(i+u,\, j+v)\, K(u,v),\]
    • where the filter is not flipped. This operator is called cross-correlation in signal processing.
  • Despite the formal difference, in practice it makes no difference for deep learning: filters are learned by gradient descent, so whether we flip them or not, the model simply learns the appropriate parameters. Thus, by convention, the community continues to call this operation “convolution,” even though it is mathematically cross-correlation.

  • The following figure illustrates the key difference between convolution and cross-correlation: the true convolution requires flipping the filter horizontally and vertically, whereas cross-correlation uses the filter as-is.

Figure 5.4.1: With a traditional convolution operation, we first flip the filter horizontally and vertically

Implications

  1. Terminological mismatch. In deep learning, “convolution” almost always means cross-correlation. This is why comparing CNNs to classical signal-processing texts can be confusing.

  2. Fourier domain. Convolution corresponds to multiplication in the Fourier domain, whereas cross-correlation corresponds to multiplication with the conjugated filter spectrum. This distinction matters in analysis but not in training CNNs, since filters are not pre-specified.

  3. Implementation. Most libraries optimize cross-correlation because it avoids flipping and is slightly more efficient. If true convolution is required (e.g., to simulate physical processes), the filter must be explicitly flipped.

Convolutions over Volume

  • Up to this point, we have considered grayscale images represented as 2D arrays of pixel intensities. However, most real-world images are colored and represented in three channels: red, green, and blue (RGB). This requires us to generalize the convolution operator from two dimensions to three dimensions (volume).
  • Convolutions over volume allow CNNs to process colored images and general multi-channel inputs by treating channels jointly within each filter. With multiple filters, CNNs produce rich, multi-dimensional feature representations that capture a wide variety of patterns.

Convolution with channels

  • Let the input be $I \in \mathbb{R}^{n_h \times n_w \times n_c}$, where $n_h$ and $n_w$ are the spatial dimensions, and $n_c$ is the number of channels (for RGB, $n_c=3$). A filter for such an input must also span the channel dimension. Specifically, a filter has size $f \times f \times n_c$, where $f$ is the spatial size. Each slice of the filter operates on one channel of the input, and the results are summed across channels:
\[(S = I \ast K)(i,j) = \sum_{c=1}^{n_c} \sum_{u=0}^{f-1} \sum_{v=0}^{f-1} I(i+u,\, j+v,\, c)\, K(u,v,c).\]
  • Thus, each output value is influenced by all channels in the input patch. This ensures that color interactions are preserved.

  • The following figure illustrates convolutions of an RGB image – how a single RGB filter applies separately to each channel of the image, before combining results into a single activation map.

Multiple filters and feature maps

  • In practice, we do not use a single filter but a bank of filters. Each filter is trained to detect a different pattern or feature in the input: edges, textures, color contrasts, or higher-level motifs. Suppose we apply $n_c’$ such filters. Then the output volume is

    \[\text{Output size: } n_h^{\text{out}} \times n_w^{\text{out}} \times n_c',\]
    • where $n_c’$ is the number of filters. Each slice along the depth of this output corresponds to the response of one learned filter. In other words, while a single convolution reduces across input channels, multiple filters extend along a new output-channel dimension.
  • The following figure demonstrates this stacking process, where each learned filter produces one feature map, and the set of all feature maps form a 3D output tensor.

Figure 5.5.2: Using a convolution on $n_c’$ filters will produce a matrix in $n_c’$ dimensions with each dimension corresponding to a result from a convoluted filter

Interpretation

  1. Channel mixing. By spanning all input channels, convolutional filters learn cross-channel correlations (e.g., red–green differences that correspond to color edges).

  2. Depth as features. The number of filters $n_c’$ determines the depth of the feature representation. Shallow CNNs may start with dozens of filters; modern deep networks can employ hundreds or thousands at later layers.

  3. Learned hierarchies. In early layers, filters often capture primitive patterns (edges, color blobs). In deeper layers, filters respond to increasingly complex features (textures, object parts, semantic regions) Krizhevsky et al., 2012.

Computational cost

  • For a single filter, the number of parameters is $f \times f \times n_c$. With $n_c’$ filters, the total number of learnable weights is $f \times f \times n_c \times n_c’$. Each output activation requires $O(f^2 \cdot n_c)$ multiply–adds. This growth in parameter size motivates architectural innovations such as 1×1 convolutions (introduced later in Section 5.9.5) that reduce dimensionality while preserving representational power.

One-Layer Convolutional Network

  • Having defined convolutions over volume, we now integrate them into the neural network framework. A convolutional layer combines linear convolution operations with bias terms and a non-linear activation function, forming the fundamental computational block of convolutional neural networks (CNNs).

Linear activation

  • For each filter $W^{[l]} \in \mathbb{R}^{f \times f \times n_c^{[l-1]}}$ at layer $l$, the pre-activation at a given spatial location is

    \[z^{[l]}(i,j,c') = \sum_{c=1}^{n_c^{[l-1]}} \sum_{u=0}^{f-1} \sum_{v=0}^{f-1} a^{[l-1]}(i+u, j+v, c)\, W^{[l]}(u,v,c,c') + b^{[l]}(c'),\]
    • where:

      • $a^{[l-1]}$ is the activation volume from the previous layer,
      • $b^{[l]}(c’)$ is the bias term associated with filter $c’$,
      • $z^{[l]}(i,j,c’)$ is the scalar linear activation before nonlinearity.

Non-linear activation

  • To introduce non-linearity and enable hierarchical feature extraction, we apply an elementwise activation function $g(\cdot)$. A common choice is the Rectified Linear Unit (ReLU):
\[a^{[l]}(i,j,c') = g\big(z^{[l]}(i,j,c')\big) = \max(0, z^{[l]}(i,j,c')).\]
  • Thus, the convolutional layer’s output is a 3D volume of shape

    \[n_h^{[l]} \times n_w^{[l]} \times n_c^{[l]},\]
    • where $n_c^{[l]}$ is the number of filters at that layer.
  • The following figure illustrates the forward pass through a one-layer convolutional network: convolutional filters act as learned weights, bias is added, and activations are passed through ReLU.

Figure 5.6.1: One layer of a convolutional neural network

Dimensions recap

  • For layer $l$:

    • Filter size: $f^{[l]}$
    • Padding: $p^{[l]}$
    • Stride: $s^{[l]}$
    • Number of filters: $n_c^{[l]}$
  • If the input to layer $l$ has dimensions $n_h^{[l-1]} \times n_w^{[l-1]} \times n_c^{[l-1]}$, then the output dimensions are

\[n_h^{[l]} = \left\lfloor \frac{n_h^{[l-1]} + 2p^{[l]} - f^{[l]}}{s^{[l]}} \right\rfloor + 1,\] \[n_w^{[l]} = \left\lfloor \frac{n_w^{[l-1]} + 2p^{[l]} - f^{[l]}}{s^{[l]}} \right\rfloor + 1,\] \[n_c^{[l]} = \text{number of filters}.\]

Parameter count

  • The number of learnable parameters in this layer is:
\[\underbrace{f^{[l]} \times f^{[l]} \times n_c^{[l-1]}}_{\text{per filter}} \times n_c^{[l]} \;+\; n_c^{[l]} \quad \text{(bias terms)}.\]
  • This is typically much smaller than in a fully connected layer, since the filter size $f^{[l]}$ is small and parameter sharing applies across spatial locations.

Intuition

  • Each filter acts as a feature detector, responding strongly to specific patterns (edges, textures, color contrasts).
  • Applying the filter across the whole input exploits parameter sharing: the same filter detects the same pattern anywhere in the image.
  • Sparse connectivity means each output depends only on a local region (the receptive field), making the computation efficient and spatially localized.

Pooling Layers

  • Pooling layers are a crucial component of convolutional neural networks (CNNs). They provide a way to reduce the spatial size of feature maps while preserving the most important information, thereby controlling overfitting, reducing computational cost, and introducing a degree of translation invariance. Unlike convolutional layers, pooling layers contain no learnable parameters—they apply a fixed aggregation function over local neighborhoods.

Mechanics of pooling

  • A pooling layer is defined by:

    • Filter size $f$: the spatial extent of the pooling window (e.g., $f=2$).
    • Stride $s$: how far the window moves across the input.
  • For each $f \times f$ window of the input, the pooling operation computes a single summary statistic, producing a downsampled output.

  • The two most common forms are:

    1. Max pooling
    \[a^{[l]}(i,j,c) = \max_{(u,v) \in f \times f} a^{[l-1]}(s i + u, \, s j + v, \, c).\]
    • It outputs the maximum activation in each window.
    1. Average pooling
    \[a^{[l]}(i,j,c) = \frac{1}{f^2} \sum_{(u,v) \in f \times f} a^{[l-1]}(s i + u, \, s j + v, \, c).\]
    • It outputs the mean activation in each window.
  • The following figure compares the two methods on a small example.

Figure 5.7.1: 2D pooling with the hyperparameters $f=2$ and $s=2$

Intuition

  • Max pooling keeps the strongest feature response, acting like a detector: “did this feature appear in this region?”
  • Average pooling captures the overall presence of features, but can blur strong signals.
  • Empirically, max pooling is preferred because it tends to preserve discriminative features more effectively, especially in classification tasks.

3D pooling

  • For multi-channel inputs, pooling is applied independently on each channel. Thus, the number of channels remains the same before and after pooling:
\[n_c^{[l]} = n_c^{[l-1]}.\]
  • Each channel produces its own downsampled activation map.

Advantages of pooling

  1. Dimensionality reduction: Reduces the size of feature maps, lowering computation in subsequent layers.
  2. Translation invariance: A small shift in the input yields the same pooled output, making models more robust.
  3. No parameters to learn: Pooling is deterministic, making it simple and efficient.
  4. Hierarchical abstraction: By shrinking representations, pooling helps higher layers focus on more abstract features.

Modern perspective

  • While pooling was dominant in early CNNs (e.g., LeNet, AlexNet, VGG), some newer architectures (e.g., ResNets, Vision Transformers) reduce reliance on pooling, instead using strided convolutions for downsampling. Nevertheless, pooling remains an important concept and is still used in many architectures.

Why We Use Convolutions

  • To fully appreciate convolutional neural networks (CNNs), it is helpful to compare them with fully connected (dense) networks applied directly to image data. At first glance, one might attempt to classify images by flattening all pixels into a vector and feeding them into a fully connected layer. However, this approach quickly becomes computationally infeasible and prone to overfitting as image sizes increase. Convolutions address these issues through parameter sharing and sparse connectivity.

  • Convolutions allow CNNs to leverage the structure of images: local connectivity, translational invariance, and feature reuse. This is the reason convolutional architectures dominate computer vision tasks, from digit recognition (LeNet) to large-scale image classification (AlexNet, VGG, ResNet, Inception).

Fully connected baseline

  • Consider an image of size $32 \times 32 \times 3$ (height × width × channels). Flattening yields a vector of length $3072$. Suppose we wish to map this input to an output volume of size $28 \times 28 \times 6 = 4704$.

  • Using a fully connected layer, the weight matrix would have

\[3072 \times 4704 \approx 14.5 \, \text{million parameters}.\]
  • Such a network is not only computationally expensive, but also very likely to overfit without enormous amounts of data.

Convolutional alternative

  • Now suppose instead we use a convolutional layer with:

    • Filter size $f = 5 \times 5$,
    • Input depth $n_c = 3$,
    • Number of filters $n_c’ = 6$.
  • Then the parameter count is:

    \[5 \times 5 \times 3 \times 6 = 450 \, \text{parameters},\]
    • plus 6 bias terms.
  • This is a reduction from 14 million down to 456 learnable parameters.

Two key ideas

  1. Parameter sharing

    • A filter is applied across the entire image, reusing the same weights at every spatial location.
    • This drastically reduces the number of parameters and allows the model to detect the same feature (e.g., an edge) anywhere in the image.
  2. Sparsity of connections

    • Each output value depends only on a small local region of the input (the receptive field).
    • Unlike fully connected layers, there is no direct dependence on all pixels simultaneously.
    • This reflects the locality principle in natural images: nearby pixels are more strongly correlated than distant ones.

Implications

  • Statistical efficiency: Fewer parameters mean fewer examples are required to train effectively.
  • Generalization: Because features are detected across the whole image, CNNs generalize better to new positions and contexts.
  • Scalability: Convolutional layers can handle large input images without exploding parameter counts.

Classic Networks

  • Designing the architecture of a neural network is often an empirical process. While the principles of convolution, pooling, and nonlinearity guide design, many breakthroughs have come from carefully constructed architectures that pushed performance on benchmark datasets. In this section, we review several historically important CNN architectures that shaped the field: LeNet-5, AlexNet, VGG-16, ResNet, and Inception (GoogLeNet).

5.9.1 LeNet-5

  • LeNet-5, introduced by LeCun et al., 1998, was one of the earliest convolutional neural networks, designed primarily for character recognition (e.g., MNIST digits).

  • Architecture: Alternating convolutional and subsampling (average pooling) layers, followed by fully connected layers.
  • Parameters: About 60,000 learnable parameters — small by modern standards.
  • Key idea: Increase the number of feature maps while reducing spatial resolution, so deeper layers capture more abstract features.
  • Historical impact: Demonstrated the viability of CNNs long before the era of massive GPUs and datasets.

  • The following figure shows the original LeNet-5 architecture, which consists of convolutional, subsampling, and fully connected layers, ending in a 10-class output.

5.9.2 AlexNet

  • AlexNet, introduced by Krizhevsky et al., 2012, marked the modern deep learning revolution by winning the 2012 ImageNet competition by a large margin.

  • Architecture: Similar to LeNet but much deeper, with 5 convolutional layers and 3 fully connected layers.
  • Parameters: ~60 million.
  • Key innovations:

    • Use of the ReLU activation function (faster convergence than sigmoid/tanh).
    • GPU-based training to scale computation.
    • Dropout for regularization.
  • Impact: Sparked a surge in research toward deep learning and CNNs for computer vision.

  • The following figure illustrates the AlexNet architecture, which expanded LeNet’s ideas into a much deeper and wider model.

5.9.3 VGG-16

  • VGG-16, introduced by Simonyan & Zisserman, 2015, became influential due to its simplicity and depth.

    • Architecture: 16 layers deep, using only $3 \times 3$ convolution filters with stride 1 and $2 \times 2$ max pooling with stride 2.
    • Parameters: ~138 million, making it one of the largest networks of its time.
    • Key insight: Stacking many small filters can approximate larger receptive fields while reducing parameters and improving generalization.
    • Impact: Provided a clean, uniform design that became a standard backbone in many applications (e.g., object detection, style transfer).
  • The following figure presents the layer progression of VGG-16, from input to convolution, pooling, and fully connected layers, ending in classification.

5.9.4 ResNet

  • Residual Networks (ResNets), introduced by He et al., 2015, solved a critical challenge in deep learning: training very deep networks.

  • Problem: As networks became deeper, adding layers often led to degradation—higher training error, not just overfitting.
  • Solution: Introduce skip connections (identity mappings) so information and gradients can flow across layers.
  • Residual block:

    \[a^{[l+2]} = g(z^{[l+2]} + a^{[l]}),\]
    • where the input is added to the output of two stacked layers.
  • Impact: Enabled training of networks with 50, 101, or even 152 layers, leading to major breakthroughs in ImageNet and beyond.

  • The following figures illustrate the difference between plain networks and residual networks, showing how skip connections help prevent degradation as depth increases.

Figure 5.9.4: Plain network. Figure 5.9.5: Residual block. Figure 5.9.6: Error rates for plain networks vs. ResNets (He et al., 2015). Figure 5.9.7: Comparison of a 34-layer ResNet vs. 34-layer plain net and VGG-19 (He et al., 2015).

5.9.5 1×1 Convolution

  • While a $1 \times 1$ convolution may seem trivial in 2D, in multi-channel inputs it has significant utility:

    • Each filter spans all channels but covers only one spatial location.
    • Output is a weighted combination of channels, enabling dimensionality reduction or expansion.
    • This allows deeper models with fewer parameters and faster computation.
  • This idea, often called “Network-in-Network,” was popularized by Lin et al., 2013.

  • The following figure shows how a $1 \times 1$ convolution operates on input channels, producing new channel combinations.

5.9.6 Inception Network (GoogLeNet)

  • Introduced by Szegedy et al., 2015, the Inception architecture (nicknamed GoogLeNet) aimed to let the model learn the best hyperparameters automatically rather than manually choosing filter sizes.

  • Key idea: At each layer, apply filters of multiple sizes ($1 \times 1$, $3 \times 3$, $5 \times 5$) as well as max pooling, then concatenate their outputs.
  • Challenge: Large filters (e.g., $5 \times 5$) are computationally expensive.
  • Solution: Precede them with $1 \times 1$ convolutions for dimensionality reduction, drastically reducing cost.
  • Impact: Showed how networks could “decide” which receptive field size works best at each layer.

  • The following figures illustrate the inception module (naïve vs. dimension-reduced version) and the full GoogLeNet architecture.

Figure 5.9.9: Inception layer (Szegedy et al., 2015). Figure 5.9.10: GoogLeNet architecture (Szegedy et al., 2015).

  • As a humorous aside, the authors cited a meme (from the film Inception) as inspiration for the network’s name.

  • Figure 5.9.11: Meme cited in the original Inception paper as a reasoning for the name.

Summary

  • LeNet-5: Demonstrated early CNN feasibility.
  • AlexNet: Sparked the deep learning revolution with large-scale image classification.
  • VGG-16: Showed that deep, uniform architectures with small filters are effective.
  • ResNet: Solved the vanishing gradient problem with skip connections, enabling very deep models.
  • Inception: Innovated by combining multiple filter sizes and dimensionality reduction.
  • Each of these architectures introduced fundamental ideas that remain central in modern CNN design.

Competitions and Benchmarks

  • Convolutional neural networks (CNNs) have evolved alongside large-scale competitions and benchmarks, which provided standardized datasets and clear performance metrics. These benchmarks drove the development of increasingly sophisticated architectures and training strategies. The most influential has been the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), which annually evaluated classification and detection performance on millions of labeled images.

Benchmark-driven progress

  • ImageNet (ILSVRC, 2010–2017): A dataset with 1.2 million training images and 1,000 classes. Success on ImageNet became the gold standard for computer vision models.
  • CNN architectures such as AlexNet (2012), VGG (2014), GoogLeNet (2014), and ResNet (2015) each established new state-of-the-art performance on ImageNet, often with error-rate reductions of several percentage points.
  • Beyond ImageNet, other datasets such as CIFAR-10/100, COCO, and Pascal VOC have served as benchmarks for smaller-scale tasks and object detection.

Test-time strategies

  • To gain competitive advantage on benchmarks, researchers have employed methods that are rarely used in practice due to computational cost:

    1. Ensembling

      • Multiple trained models (often 3–15 CNNs) are evaluated on the same test data.
      • Predictions are combined either by majority vote (for classification) or by averaging (for regression).
      • While this often improves performance, it multiplies inference cost and memory requirements.
    2. Multi-crop evaluation

      • Instead of passing a single crop of the input image, multiple crops (e.g., center, corners, and mirrored versions) are used.
      • Each crop is fed through the model, and the results are averaged.
      • This reduces sensitivity to cropping and improves robustness, but significantly increases inference time.
  • The following figure illustrates the 10-crop technique, where an image is mirrored and multiple different crops are taken from both the original and mirrored versions.

Figure 5.10.1: 10-crop includes mirroring an image and taking different croppings of both the image and its mirroring.

Practical considerations

While ensembling and multi-crop evaluations are powerful, they are computationally expensive and rarely deployed in real-world production systems, where latency and efficiency are critical. Instead, practitioners favor:

  • Single models optimized for inference speed.
  • Data augmentation during training (rather than test-time cropping).
  • Model compression and quantization to reduce runtime cost.

Citation

If you found our work useful, please cite it as:

@article{Chadha2020CNNs,
  title   = {Convolutional Neural Networks},
  author  = {Chadha, Aman},
  journal = {Distilled Notes for Stanford CS230: Deep Learning},
  year    = {2020},
  note    = {\url{https://aman.ai}}
}