Aman's AI Journal • Primers • Clustering

Overview of Clustering
Types of Clustering
FAQ: What are some methods that do not require a predefined $k$ value for clustering?
Comparison: k-Nearest Neighbors vs. k-Means Clustering
- Key Takeaways
Citation

Overview of Clustering

Clustering is an unsupervised machine learning technique used to group data points into clusters based on their similarity or dissimilarity. It identifies patterns or structures in a dataset without predefined labels, making it particularly useful in exploratory data analysis. Applications include market segmentation, image segmentation, document categorization, anomaly detection, and more.

Types of Clustering

Clustering techniques vary widely, but they generally aim to reveal the underlying structure of data by grouping similar items together. Some methods focus on flat partitions, while others explore hierarchical relationships among data points.
Partitioning Clustering: These methods divide data into non-overlapping subsets (clusters) such that each data point belongs to exactly one cluster. K-means and K-medoids are popular examples of this type, where the goal is to optimize an objective function like minimizing within-cluster variance.
Hierarchical Clustering: This approach organizes data into a tree-like structure, where clusters are nested within larger clusters. It can be further categorized into:
- Agglomerative Clustering (Bottom-Up): Starts with each data point as its own cluster and merges clusters iteratively.
- Divisive Clustering (Top-Down): Starts with all data points in a single cluster and splits them recursively into smaller clusters.
Density-Based Clustering: These techniques, like DBSCAN and OPTICS, group data points based on density. Clusters are formed where the data is dense, and points in low-density regions are considered noise.
Grid-Based Clustering: This method divides the data space into a grid structure and forms clusters based on the density of points within the grid cells. Examples include STING and CLIQUE.
Model-Based Clustering: Assumes that data is generated from a mixture of underlying probability distributions. Gaussian Mixture Models (GMMs) are a typical example, where each cluster corresponds to a Gaussian distribution.

Partitioning Clustering

Partitioning clustering is a popular approach in unsupervised learning used to divide a dataset into $k$ distinct, non-overlapping clusters. Each data point is assigned to exactly one cluster, with the goal of optimizing a specific objective function, such as minimizing the sum of squared distances between data points and their respective cluster centers. This method is particularly effective when the number of clusters is known or can be reasonably estimated.
Partitioning clustering is highly versatile, making it suitable for a wide range of datasets. Different techniques within this framework offer unique advantages. For instance, K-means is favored for its simplicity and computational efficiency, while K-medoids is more robust to outliers. Additionally, fuzzy clustering methods, such as Fuzzy C-means, are useful for capturing overlaps and handling uncertainty in cluster assignments. The choice of method depends on the characteristics of the data and the specific objectives of the clustering task.

Overview

Approach: Flat clustering (as opposed to hierarchical methods).
Objective: Assign data points to clusters to optimize a criterion, typically the within-cluster variance or dissimilarity.
Examples:
- K-means Clustering: Uses centroids as cluster representatives.
- K-medoids Clustering (PAM): Uses medoids (actual data points) as representatives.
- Variants include fuzzy clustering (e.g., Fuzzy C-means) where points can belong to multiple clusters with varying degrees.

K-Means Clustering

Algorithm

Initialization:
- Select $k$, the number of clusters.
- Randomly initialize $k$ cluster centroids or choose them from the data points.
Assignment Step:
- Assign each data point $x_i$ to the nearest cluster based on a distance metric (typically Euclidean distance): $\text{Cluster}(x_i) = \arg\min_{j} \|x_i - \mu_j\|^2$ where $\mu_j$ is the centroid of the $j$-th cluster.
Update Step:
- Recalculate the centroids of each cluster as the mean of all points assigned to the cluster: $\mu_j = \frac{1}{|C_j|} \sum_{x \in C_j} x$
Repeat:
- Alternate between the assignment and update steps until centroids converge or a stopping criterion is met (e.g., maximum iterations or minimal change in centroids).

Advantages

Simple and fast, with a time complexity of $O(n \cdot k \cdot t)$, where $n$ is the number of data points, $k$ is the number of clusters, and $t$ is the number of iterations.
Works well for compact, spherical clusters.

Disadvantages

Requires specifying $k$ in advance.
Sensitive to outliers and initialization.
Assumes clusters are convex and isotropic (spherical).

K-Medoids Clustering (PAM)

K-medoids, or Partitioning Around Medoids, is similar to K-means but uses actual data points (medoids) as cluster representatives, making it more robust to outliers.

Algorithm

Initialization:
- Select $k$ data points as the initial medoids.
Assignment Step:
- Assign each data point to the cluster represented by the nearest medoid.
Update Step:
- For each cluster, choose the data point that minimizes the sum of distances to all other points in the cluster as the new medoid.
Repeat:
- Alternate between assignment and update steps until medoids stabilize.

Advantages

More robust to outliers than K-means.
Does not assume cluster shapes.

Disadvantages

Computationally expensive for large datasets due to pairwise distance calculations.

Fuzzy C-Means Clustering

Unlike K-means or K-medoids, Fuzzy C-means allows data points to belong to multiple clusters with varying degrees of membership.

Algorithm

Initialization:
- Select $k$ and initialize membership matrix $U$, where $U_{ij}$ represents the membership degree of point $x_i$ to cluster $j$.
Membership Update:
- Update membership degrees based on cluster centroids and distance: $U_{ij} = \frac{1}{\sum_{k=1}^{c} \left(\frac{\|x_i - \mu_j\|}{\|x_i - \mu_k\|}\right)^{\frac{2}{m-1}}}$
  - where $m > 1$ controls the fuzziness.
Centroid Update:
- Update centroids using the weighted average: $\mu_j = \frac{\sum_{i=1}^{n} U_{ij}^m x_i}{\sum_{i=1}^{n} U_{ij}^m}$
Repeat:
- Iterate until convergence.

Advantages

Captures uncertainty and overlapping clusters.
Flexible for non-spherical clusters.

Disadvantages

Sensitive to initialization and requires selecting $m$ and $k$.
Slower than K-means for large datasets.

Comparison of Partitioning Methods

Feature	K-Means	K-Medoids	Fuzzy C-Means
Cluster Representative	Centroid (mean)	Medoid (actual point)	Weighted average
Robustness	Sensitive to outliers	Robust to outliers	Handles overlapping
Speed	Fast ($O(nkt)$)	Slower ($O(k(n-k)^2)$)	Moderate
Cluster Shape	Spherical	Arbitrary	Arbitrary
Output	Hard partitions	Hard partitions	Soft partitions

Applications

K-Means:
- Image segmentation.
- Market segmentation.
- Document clustering.
K-Medoids:
- Gene expression analysis.
- Anomaly detection.
Fuzzy C-Means:
- Medical imaging.
- Weather pattern analysis.

Hierarchical Clustering

Hierarchical clustering, one of the key clustering techniques, can be performed in two primary ways: Agglomerative Clustering (bottom-up) and Divisive Clustering (top-down). Let’s explore both methods, their algorithms, mathematical formulations, and use cases, along with examples.

Agglomerative Clustering

Overview

Agglomerative clustering starts with each data point as its own cluster and iteratively merges clusters until a stopping criterion is met (e.g., a desired number of clusters).
Approach: Bottom-up
Process:
1. Treat each data point as a singleton cluster.
2. Compute pairwise similarity (or dissimilarity) between clusters.
3. Merge the two most similar clusters.
4. Repeat until the termination condition is met.

Algorithm

Initialization: Start with $n$ clusters, where $n$ is the number of data points.
Distance Matrix Computation:
- Compute the distance $d(C_i, C_j)$ between all pairs of clusters $C_i$ and $C_j$.
- Common metrics:
  - Euclidean Distance: $d(x, y) = \sqrt{\sum_{i=1}^{p} (x_i - y_i)^2}$
  - Manhattan Distance: $d(x, y) = \sum_{i=1}^{p} |x_i - y_i|$
  - Cosine Similarity: $\text{similarity}(x, y) = \frac{x \cdot y}{\|x\| \|y\|}$
Cluster Merging:
- Merge the two clusters with the smallest distance/similarity.
- Update the distance matrix to reflect the newly formed cluster.
- Use linkage criteria to determine cluster distance:
  - Single Linkage (Minimum Distance): $d(C_i, C_j) = \min_{x \in C_i, y \in C_j} d(x, y)$
  - Complete Linkage (Maximum Distance): $d(C_i, C_j) = \max_{x \in C_i, y \in C_j} d(x, y)$
  - Average Linkage: $d(C_i, C_j) = \frac{1}{|C_i||C_j|} \sum_{x \in C_i} \sum_{y \in C_j} d(x, y)$
  - Centroid Linkage: $d(C_i, C_j) = d(\mu(C_i), \mu(C_j))$ where $\mu(C)$ is the centroid of cluster $C$.
Repeat: Continue merging until the stopping criterion is satisfied.

Example

Data

Points: $A = (1, 1), B = (2, 2), C = (5, 5), D = (6, 6)$

Steps

Initial Distance Matrix (Euclidean Distance):

	A	B	C	D
A	0	1.41	5.66	7.07
B	1.41	0	4.24	5.66
C	5.66	4.24	0	1.41
D	7.07	5.66	1.41	0

Merge Closest Clusters: $A$ and $B$.
Update Distance Matrix: Using single linkage, merge $\{A, B\}$.

	AB	C	D
AB	0	4.24	5.66
C	4.24	0	1.41
D	5.66	1.41	0

Repeat Until All Clusters Merged.

Advantages

Does not require the number of clusters beforehand.
Suitable for hierarchical relationships.

Disadvantages

Computationally expensive ($O(n^3)$).
Sensitive to noisy data and outliers.

Divisive Clustering

Overview

Divisive clustering starts with all data points in a single cluster and recursively splits clusters until a stopping criterion is met.

Approach: Top-down
Process:
1. Treat all data points as one cluster.
2. Split the cluster into two sub-clusters using a splitting criterion.
3. Repeat recursively for each sub-cluster.

Algorithm

Initialization: Start with a single cluster containing all data points.
Splitting:
- Choose the cluster to split.
- Divide into two clusters based on a measure like:
  - K-means clustering.
  - Principal Component Analysis (PCA) to project and split.
Repeat: Continue splitting until a termination condition is met:
- A fixed number of clusters.
- Minimum intra-cluster variance.

Example

Data

Points: $A = (1, 1), B = (2, 2), C = (5, 5), D = (6, 6)$

Steps

Start with $\{A, B, C, D\}$.
Split using k-means:
- Cluster 1: $\{A, B\}$
- Cluster 2: $\{C, D\}$.
Recurse:
- $\{A, B\}$ splits into $A$ and $B$.
- $\{C, D\}$ splits into $C$ and $D$.

Advantages

Can be more efficient than agglomerative for large datasets.
Handles global structure better.

Disadvantages

Computationally expensive ($O(2^n)$ in the worst case).
Requires splitting criteria, which may not always be straightforward.

Comparison

Feature	Agglomerative Clustering	Divisive Clustering
Approach	Bottom-up	Top-down
Initialization	Start with $n$ clusters	Start with 1 cluster
Merging/Splitting	Merge closest clusters	Split clusters recursively
Computational Cost	Higher for large datasets	More scalable for large datasets
Cluster Formation	Fine-grained at start	Coarse-grained at start
Algorithmic Flexibility	Easier to implement, multiple linkage criteria	Requires effective splitting strategy

Key Equations

Distance Calculation:
- $d(x, y) = \sqrt{\sum (x_i - y_i)^2}$ (Euclidean Distance).
Linkage Criteria:
- Single Linkage: $d_{\text{min}} = \min_{i,j} d(x_i, y_j)$.
- Complete Linkage: $d_{\text{max}} = \max_{i,j} d(x_i, y_j)$.
Cluster Variance (for Divisive):
- $\text{Intra-cluster variance} = \sum_{i=1}^k \sum_{x \in C_i} \|x - \mu(C_i)\|^2$.

Use Cases

Agglomerative Clustering:
- Social network analysis.
- Gene expression analysis.
- Customer segmentation.
Divisive Clustering:
- Document classification.
- Large-scale hierarchical data organization.

Summary

Agglomerative Clustering (Bottom-Up)

Starts with each data point as its own cluster.
Iteratively merges the closest pairs of clusters until a single cluster (or a predefined number of clusters) remains.
Common linkage criteria:
- Single Linkage: Minimum distance between points in two clusters.
- Complete Linkage: Maximum distance between points in two clusters.
- Average Linkage: Average distance between points in two clusters.

Divisive Clustering (Top-Down)

Begins with all data points in one cluster.
Recursively splits clusters into smaller groups based on dissimilarity.
Less common than agglomerative clustering due to its computational complexity.

K-Means Clustering

Overview

Clustering is a versatile tool in data analysis, used to group data points into meaningful subsets based on their characteristics. Among the various clustering methods, K-means is a widely used partitioning-based approach. It divides $n$ data points into $k$ clusters, assigning each point to the cluster with the nearest mean. This method aims to minimize intra-cluster variance, resulting in compact and well-separated clusters.
Despite its simplicity and efficiency, K-means is often best complemented by other techniques to accommodate the diverse characteristics and complexities of different datasets.

Algorithm Description

The K-means algorithm consists of the following steps:

Initialization: Randomly initialize $k$ centroids (cluster centers) or select $k$ random points from the dataset.
Assignment: Assign each data point to the nearest centroid based on the Euclidean distance: $d(x, c) = \sqrt{\sum_{i=1}^{n} (x_i - c_i)^2}$ where $x$ is the data point, $c$ is the centroid, and $n$ is the number of features.
Update: Recompute the centroids of the clusters as the mean of all points assigned to each cluster: $c_j = \frac{1}{|C_j|} \sum_{x \in C_j} x$ where $C_j$ is the set of points in cluster $j$, and $|C_j|$ is the number of points in $C_j$.
Repeat: Iterate steps 2 and 3 until the centroids stabilize (i.e., no significant change in their positions) or a maximum number of iterations is reached.

Objective Function

The goal of K-means is to minimize the within-cluster sum of squares (WCSS): $J = \sum_{j=1}^{k} \sum_{x \in C_j} \|x - c_j\|^2$
- where, $J$ represents the total variance within clusters.

Parameters

$k$: Number of clusters.
Initialization method for centroids.
Distance metric (Euclidean distance is standard but other metrics like Manhattan distance can be used).
Maximum number of iterations.

Advantages

Simple to implement and computationally efficient.
Works well with compact, spherical clusters.
Scales to large datasets.

Disadvantages

Sensitive to the initial placement of centroids; poor initialization can lead to suboptimal solutions.
Assumes clusters are spherical and equally sized.
Struggles with non-linear separability and varying cluster densities.
Requires specifying $k$, which may not be known in advance.

Example: Applying K-Means Clustering

Dataset

Consider a 2D dataset with the following points:

\[(1, 1), (1.5, 2), (3, 4), (5, 7), (3.5, 5), (4.5, 5), (3.5, 4.5)\]

Step-by-Step Execution

Initialization: Let $k = 2$. Randomly initialize centroids at $c_1 = (1, 1)$ and $c_2 = (5, 7)$.
- Initial centroids: $c_1 = (1, 1), \ c_2 = (5, 7)$
Assignment: Compute the Euclidean distance of each point to the centroids and assign points to the nearest centroid. The Euclidean distance for each point to both centroids is calculated as follows:
- For point $(1, 1)$: $d((1, 1), c_1) = \sqrt{(1-1)^2 + (1-1)^2} = 0$ $d((1, 1), c_2) = \sqrt{(1-5)^2 + (1-7)^2} = \sqrt{16 + 36} = \sqrt{52}$
- Repeat the above for all points in the dataset:
  - Point $(1.5, 2)$: Closest to $c_1$
  - Point $(3, 4)$: Closest to $c_1$
  - Point $(5, 7)$: Closest to $c_2$
  - Point $(3.5, 5)$: Closest to $c_2$
  - Point $(4.5, 5)$: Closest to $c_2$
  - Point $(3.5, 4.5)$: Closest to $c_2$
Update: Recalculate the centroids as the mean of the points assigned to each cluster:
- For $c_1$: $c_1 = \text{mean of points assigned to } c_1 = \frac{(1, 1) + (1.5, 2) + (3, 4)}{3} = (1.833, 2.333)$
- For $c_2$: $c_2 = \text{mean of points assigned to } c_2 = \frac{(5, 7) + (3.5, 5) + (4.5, 5) + (3.5, 4.5)}{4} = (4.125, 5.375)$
Repeat: Reassign points to the nearest updated centroids and recalculate the centroids again. Continue until centroids converge or maximum iterations are reached.

FAQ: What are some methods that do not require a predefined $k$ value for clustering?

In clustering, most algorithms require the user to specify the number of clusters ($k$), but there are several methods that do not rely on a predefined $k$. These methods determine the number of clusters dynamically or use a different approach. Below are some clustering techniques that do not require a predefined $k$:

Hierarchical Clustering

Description: Hierarchical clustering constructs a hierarchy of clusters in a recursive manner. It begins by treating each data point as an individual cluster (agglomerative) or starts with all data points in a single cluster and progressively splits them (divisive). The resulting hierarchy is visualized in a dendrogram.
No Need for $k$: Instead of needing $k$ upfront, the dendrogram allows users to determine the appropriate number of clusters post hoc by setting a threshold on the linkage distance.
Examples:
- Agglomerative clustering (merging clusters bottom-up).
- Divisive clustering (splitting clusters top-down).

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Description: DBSCAN groups points that are closely packed together based on a specified density threshold and marks points in sparse regions as noise. It is especially effective for identifying clusters of arbitrary shapes.
Parameters: Requires two key inputs:
- $\epsilon$: Maximum radius for a neighborhood.
- $minPts$: Minimum number of points required to form a dense cluster.
No Need for $k$: Clusters are formed dynamically based on density conditions without requiring the number of clusters to be predefined.

OPTICS (Ordering Points To Identify the Clustering Structure)

Description: OPTICS builds upon DBSCAN to address its sensitivity to the $\epsilon$ parameter. It orders data points in a way that captures cluster structures of varying densities and avoids strict reliance on a single density threshold.
No Need for $k$: The algorithm produces a reachability plot, enabling identification of clusters with varying densities, without a predefined $k$.

Mean-Shift Clustering

Description: Mean-shift identifies clusters by iteratively shifting each data point toward the region of highest density within a defined radius. It converges to local maxima of the density function, effectively partitioning the data.
Parameters:
- Bandwidth: Determines the size of the neighborhood considered for density estimation.
No Need for $k$: The algorithm adapts to the data, converging on a set of centroids that correspond to the clusters, dynamically determining the number of clusters.

Affinity Propagation

Description: This method relies on message passing between data points to identify exemplars, which act as cluster centers. Data points are assigned to clusters based on the identified exemplars.
Parameters:
- Preference: A parameter influencing the number of exemplars; it can be tuned based on data properties.
No Need for $k$: The number of clusters is discovered automatically during the exemplar identification process.

Gaussian Mixture Models (GMM) with Model Selection

Description: GMM assumes data is generated from a mixture of Gaussian distributions. It estimates the parameters of these distributions and assigns probabilities to data points for belonging to each component.
Model Selection:
- Uses information criteria such as Bayesian Information Criterion (BIC) or Akaike Information Criterion (AIC) to evaluate models with different numbers of components.
No Need for Predefined $k$: The model selection process determines the optimal number of components (clusters).

Self-Organizing Maps (SOMs)

Description: SOMs are a type of neural network that map high-dimensional data onto a lower-dimensional grid. The grid topology imposes a structure on the data, enabling cluster identification.
No Need for $k$: The topology of the grid itself influences the clustering outcome, dynamically revealing clusters without explicitly requiring $k$.

Spectral Clustering with Eigen Gap Analysis

Description: Spectral clustering uses the eigenvalues of a similarity matrix derived from the data to partition it into clusters. It leverages graph theory principles to create natural divisions in the data.
Dynamic $k$: The eigen gap analysis method identifies the number of clusters by observing significant drops in the sorted eigenvalue spectrum.

Birch (Balanced Iterative Reducing and Clustering Using Hierarchies)

Description: Birch uses a hierarchical clustering tree (CF-tree) structure to cluster large datasets incrementally and dynamically. It is particularly efficient for handling massive datasets.
No Need for $k$: Clusters are formed dynamically in the initial stage without requiring $k$, although it can optionally refine clusters if $k$ is specified later.

Autoencoders for Clustering

Description: Autoencoders are neural networks designed to learn compressed representations of input data. These representations capture essential features that can be used for clustering using secondary algorithms such as DBSCAN or hierarchical clustering.
Dynamic $k$: The latent space produced by the autoencoder often reveals intrinsic cluster structures, making it possible to identify clusters without a predefined $k$.

Comparison of Approaches

Clustering Method	Description	Key Parameters	How $$k$$ is Determined	Strengths	Weaknesses
Hierarchical Clustering	Builds a hierarchy of clusters visualized in a dendrogram.	Linkage method, distance metric	Determined post hoc by cutting the dendrogram at a desired distance threshold.	No need for predefining $$k$$; interpretable dendrogram.	Computationally expensive for large datasets; sensitive to noise.
DBSCAN	Groups densely packed points and identifies noise.	$\epsilon$ (neighborhood radius), $minPts$	Clusters formed dynamically based on density conditions.	Identifies clusters of arbitrary shapes; handles noise.	Sensitive to $$\epsilon$$ and $$minPts$$; struggles with varying densities.
OPTICS	Orders points to identify clusters with varying densities.	$\epsilon$ (optional), $minPts$	Clusters inferred from reachability plot without fixed $$k$$.	Handles varying densities; less sensitive to $$\epsilon$$ than DBSCAN.	Computationally intensive; interpretation of reachability plot required.
Mean-Shift Clustering	Iteratively shifts data points toward regions of highest density.	Bandwidth (radius for density estimation)	Centroids determined dynamically based on data density.	No predefined $$k$$; handles arbitrary cluster shapes.	Bandwidth selection is critical; can be computationally expensive.
Affinity Propagation	Uses message passing to identify exemplars as cluster centers.	Preference (affects number of exemplars)	Automatically identifies the number of clusters during message-passing process.	No $$k$$ required; adaptable to data structure.	High memory and computation cost; sensitive to preference parameter.
GMM with Model Selection	Fits data to a mixture of Gaussian distributions, selecting the best model using information criteria.	Initial guesses for parameters	Model selection (BIC/AIC) identifies optimal number of components (clusters).	Flexible probabilistic clustering; works well for elliptical clusters.	Assumes Gaussian distribution; sensitive to initialization.
Self-Organizing Maps (SOMs)	Maps high-dimensional data to a grid, revealing cluster structures.	Grid topology (e.g., size, shape)	Clusters emerge dynamically based on grid topology and data distribution.	Effective for dimensionality reduction and clustering; visual insights from the grid.	Requires expertise in setting grid parameters; less precise than other methods for irregularly shaped clusters.
Spectral Clustering (Eigen Gap)	Uses graph theory to partition data, with clusters inferred from eigenvalue spectrum analysis.	Similarity matrix construction parameters	Eigen gap analysis determines the number of clusters dynamically.	Handles non-convex clusters; robust for complex data structures.	Computationally expensive for large datasets; requires constructing a similarity graph.
Birch	Uses CF-tree structure for efficient clustering of large datasets.	Threshold for branching/merging	Dynamically forms clusters during initial phase; $$k$$ optional for refinement.	Scalable for large datasets; can incorporate $$k$$ refinement.	Struggles with non-spherical clusters and datasets with significant noise.
Autoencoders for Clustering	Neural network learns compressed latent space, revealing intrinsic cluster structures.	Network architecture, training parameters	Latent space often reveals clusters dynamically; secondary clustering methods used.	Captures complex patterns in high-dimensional data; flexible combination with clustering methods.	Requires extensive tuning; dependent on secondary clustering method for final cluster assignment.

Comparison: k-Nearest Neighbors vs. k-Means Clustering

Here is a detailed comparative analysis of k-Nearest Neighbors and k-Means Clustering:

Aspect	k-Nearest Neighbors (kNN)	k-Means Clustering
Type of Algorithm	Supervised learning algorithm (classification or regression).	Unsupervised learning algorithm (clustering).
Purpose	Predicts the label or value for a given data point based on its neighbors.	Groups data into $k$ clusters based on similarity.
Input Requirements	Labeled data for training (requires both features and target values).	Unlabeled data (only features are required).
Working Mechanism	Finds the $k$-nearest points in the training dataset to a given query point and uses their labels to make predictions.	Iteratively partitions data into $k$ clusters by minimizing intra-cluster variance.
Distance Metric	Typically uses Euclidean distance, but other metrics like Manhattan or Minkowski can also be used.	Uses Euclidean distance (or other metrics) to compute cluster centroids.
Output	Predicted label (classification) or value (regression).	Cluster assignments for each data point.
Training Phase	No explicit training; kNN is a lazy learner and computes distances during prediction.	Training involves multiple iterations to adjust centroids and assign clusters.
Prediction Phase	Involves computing distances from the query point to all training points.	Assigns new data points to the nearest cluster based on the trained centroids.
Scalability	Not scalable; high computational cost for large datasets due to distance calculations during prediction.	Scalable with optimizations; faster for large datasets after training.
Parameters	Number of neighbors ($k$) and distance metric.	Number of clusters ($k$) and initialization of centroids.
Sensitivity to Parameters	Sensitive to the choice of $k$; inappropriate $k$ can lead to overfitting or underfitting.	Sensitive to the choice of $k$ and initialization of centroids; poor initialization can lead to suboptimal clustering.
Interpretability	Intuitive; directly uses neighboring points for prediction.	Less intuitive; requires interpreting clusters.
Handling of Outliers	Outliers can strongly influence predictions by affecting the nearest neighbors.	Outliers can distort cluster centroids and lead to poor clustering.
Applications	Classification (e.g., image recognition, fraud detection) and regression.	Clustering (e.g., customer segmentation, document classification).
Strengths	Simple, effective, and easy to implement; no need for explicit training.	Works well for clustering large datasets; discovers inherent structure.
Weaknesses	Computationally expensive during prediction; performance decreases with irrelevant features.	Sensitive to the choice of $k$; may converge to local minima.

Key Takeaways

kNN: A supervised algorithm used for prediction tasks, often with low computational overhead in the training phase but expensive at prediction time.
k-Means: An unsupervised algorithm used to discover groupings in data, efficient for partitioning compact clusters but sensitive to initialization.
Both algorithms rely heavily on distance metrics, making feature scaling (e.g., normalization or standardization) critical for their performance.

Citation

If you found our work useful, please cite it as:

@article{Chadha2020DistilledClustering,
  title   = {Clustering},
  author  = {Chadha, Aman},
  journal = {Distilled AI},
  year    = {2020},
  note    = {\url{https://aman.ai}}
}

Overview of Clustering

Types of Clustering

Partitioning Clustering

Overview

K-Means Clustering

Algorithm

Advantages

Disadvantages

K-Medoids Clustering (PAM)

Algorithm

Advantages

Disadvantages

Fuzzy C-Means Clustering

Algorithm

Advantages

Disadvantages

Comparison of Partitioning Methods

Applications

Hierarchical Clustering

Agglomerative Clustering

Overview

Algorithm

Example

Data

Steps

Advantages

Disadvantages

Divisive Clustering

Overview

Algorithm

Example

Data

Steps

Advantages

Disadvantages

Comparison

Key Equations

Use Cases

Summary

Agglomerative Clustering (Bottom-Up)

Divisive Clustering (Top-Down)

K-Means Clustering

Overview

Algorithm Description

Objective Function

Parameters

Advantages

Disadvantages

Example: Applying K-Means Clustering

Dataset

Step-by-Step Execution

FAQ: What are some methods that do not require a predefined \(k\) value for clustering?

Hierarchical Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

OPTICS (Ordering Points To Identify the Clustering Structure)

Mean-Shift Clustering

Affinity Propagation

Gaussian Mixture Models (GMM) with Model Selection

Self-Organizing Maps (SOMs)

Spectral Clustering with Eigen Gap Analysis

Birch (Balanced Iterative Reducing and Clustering Using Hierarchies)

Autoencoders for Clustering

Comparison of Approaches

Comparison: k-Nearest Neighbors vs. k-Means Clustering

Key Takeaways

Citation