Gaussian Mixture Models Explained
Back to Blog 5 min read 1. Overview 1.1. What are Gaussian Mixture Models? Gaussian Mixture Models are probabilistic models that expresse a probability distribution as a combination of multiple Gaussian (normal) distributions, each contributing with a certain weight. 1.2. Original Motivation Designed to capture heterogeneous or multi-modal data by breaking down a complex distribution into simpler, interpretable Gaussian components. 1.3. Relevant Papers Karl Pearson (1894): Early work on fitting mixtures of normal distributions. Dempster, Laird, Rubin (1977) – “Maximum Likelihood from Incomplete Data via the EM Algorithm”: Introduced the EM algorithm, which made maximum-likelihood estimation of mixture models computationally feasible and remains the standard fitting approach. Duda & Hart (1973): Their textbook spread the use of GMMs in clustering and classification. 1.4. Current Status GMMs are still extensively applied and continuously supported in modern statistics and machine learning libraries (such as R and Python). Ongoing research continues to refine methodology and applications. 1.5. Relevance Although deep learning has overtaken GMMs in areas like large-scale speech recognition, they remain valuable as interpretable, efficient, and lightweight models for density estimation, clustering, and as components in larger ML pipelines. 1.6. Applications (Historical & Modern) Speech recognition: Before deep learning, GMM–HMM systems were the backbone of acoustic modeling. Computer vision: Widely used for background subtraction and motion detection through pixelwise Gaussian mixtures. General clustering & density modeling: Found in diverse scientific and engineering domains such as biomedical imaging, astronomy, and anomaly detection. 2. Probabilistic Modeling with GMMs for Clustering GMMs (Gaussian Mixture Models) are popular for data clustering. 2.1. What is Clustering? Clustering is a form of unsupervised learning. The goal is to divide multi-dimensional data into clusters, or groups of data points.While this post series is about the task of generating data, understanding data clustering can be very helpful. The main reason is because it demonstrates the importance of probabilistic modeling, as we will see when comparing the K-means algorithm and GMMs for clustering. unclustered data true clusters 2.2. K-Means Clustering An iterative algorithm used for clustering (not probabilistic). Suppose we have a dataset ( mathbf{x} = {mathbf{x}_1,…,mathbf{x}_N} ), the number of observations ( N ), and the dimension of the data (each sample ( mathbf{x}_i )) is ( D ).The goal is to divide the data into ( K ) clusters. For that, the K-means algorithm introduces ( K ) vectors ( boldsymbol{mu}_k ) with dimensionality ( D ), where each vector ( boldsymbol{mu}_k ) represents the mean of cluster ( k ). The algorithm then finds those ( boldsymbol{mu}_k ) and the associated data points such that the distance of each data point to the corresponding mean ( boldsymbol{mu}_k ) is minimal. The association is determined through a parameter ( r_{ik} in {0,1} ), where ( r_{ik} ) is equal to 1 only for the selected cluster, otherwise it is 0.The parameters ( boldsymbol{mu}_k ) and ( r_{ik} ) are found by solving this optimization problem: [ (r,m) = arg min_{r,m} sum_{i}^{n} sum_{k}^{K} r_{ik}|mathbf{x}_i – mathbf{m}_k|^2 ] 2.3. Mixture Models Simplest form of latent variable models (probabilistic). We assume there are ( K ) Gaussian components that, combined, determine the distribution of the data ( mathbf{X} ). There is a discrete latent random variable ( Z ) with realizations ( z in {1,2,…,K} ), where each value ( z ) corresponds to a component of the mixture model. This latent variable simply states that the sample ( mathbf{x}_i ) belongs to the cluster ( k ). As discussed in post 1, we use probability for modeling, so we consider ( P(z=k) ) to be the probability of picking the ( k )-th component. These are also the weights: the contribution of each component to the mixture. ( sum P(z=k) = 1 ). Each component can be a different model, for example, a Gaussian or a Bernoulli. Each model has learnable parameters that define the distribution, for example, ( boldsymbol{mu} ) and ( boldsymbol{Sigma} ) in the case of a Gaussian or ( p ) (probability of success) in the case of a Bernoulli. From a probabilistic perspective, a cluster corresponds to a mixture component. The data distribution is a mixture (summation) of models. Mixture models are represented by the following equation: [ p(mathbf{x}_i|boldsymbol{theta}) = sum_{k=1}^{K} p_Z(mathbf{k}) p_{X|Z}(mathbf{x}_i|mathbf{k},boldsymbol{theta}) ] ( boldsymbol{theta} ) are the learned parameters. ( K ) is a hyperparameter indicating the estimated number of clusters (not learned). ( P(mathbf{x}|z) ) is the probability of observing ( mathbf{x} ) given that it was generated from a specific component ( k ) for ( z=k ). 2.4. GMMs GMMs use the multivariate Gaussian distribution as a model: [ f_{mathbf{X}}(x_1, dots, x_k) = frac{expleft(-frac{1}{2}(mathbf{x}-boldsymbol{mu})^{text{T}}boldsymbol{Sigma}^{-1}(mathbf{x}-boldsymbol{mu})right)}{sqrt{(2pi)^k|boldsymbol{Sigma}|}} ] The corresponding equation is: [ p(mathbf{x}_i|boldsymbol{theta}) = sum_{k=1}^{K} p_Z(mathbf{k}) mathcal{N}(mathbf{x}_i|boldsymbol{mu}_k, boldsymbol{Sigma}_k) ] Now, to simplify notation, ( theta = {mu, Sigma, pi} ). Those are the learnable parameters. 2.5. Transition from K-Means to GMM and Role of Probabilistic Modeling In this context, GMMs are probabilistic models usually presented as a generalization of the K-means algorithm. The fact that K-means is not probabilistic makes the resulting clusters unintuitive and in many cases wrong. That’s why K-means is considered a rigid clustering method while GMMs allow for soft clustering by adding variance to the equation, as shown in the equation below. The corresponding equation is: [ p(mathbf{x}_i|boldsymbol{theta}) = sum_{k=1}^{K} p_Z(mathbf{k}) mathcal{N}(mathbf{x}_i|boldsymbol{mu}_k, boldsymbol{Sigma}_k) ] The mathematical derivation of the transition from K-means to GMM by including a parameter for variance is as follows: [begin{align*}(r,m) &= arg min_{r,m} sum_{i}^{n} sum_{k}^{K} r_{ik}|mathbf{x}_i – mathbf{m}_k|^2 \&= arg max_{r,m} sum_{i}^{n} sum_{k}^{K} r_{ik}(-1/2sigma^{-2}|mathbf{x}_i – mathbf{m}_k|^2) + text{const.} \&= arg max_{r,m} prod_{i} sum_{k} r_{ik} exp (-1/2sigma^{-2}|mathbf{x}_i – mathbf{m}_k|^2) /Z \&= arg max_{r,m} prod_{i} sum_{k} r_{ik}mathcal{N}(mathbf{x}_i; mathbf{m}_k, sigma^2I) \&= arg max_{r,m} p(mathbf{x}|mathbf{m}, r)end{align*}] Line 2: We added a minus sign to go from minimization to maximization, then multiplied by a constant ( sigma ), which does not change the solution (the location of the maximum). The const. term represents any parts of the Gaussian PDF that do not depend on ( (r,m) ). It is