Jasser Kraiem - Falexis

Gaussian Mixture Models Explained

Jasser Kraiem / March 8, 2026

Back to Blog 5 min read 1. Overview 1.1. What are Gaussian Mixture Models? Gaussian Mixture Models are probabilistic models that expresse a probability distribution as a combination of multiple Gaussian (normal) distributions, each contributing with a certain weight. 1.2. Original Motivation Designed to capture heterogeneous or multi-modal data by breaking down a complex distribution into simpler, interpretable Gaussian components. 1.3. Relevant Papers Karl Pearson (1894): Early work on fitting mixtures of normal distributions. Dempster, Laird, Rubin (1977) – “Maximum Likelihood from Incomplete Data via the EM Algorithm”: Introduced the EM algorithm, which made maximum-likelihood estimation of mixture models computationally feasible and remains the standard fitting approach. Duda & Hart (1973): Their textbook spread the use of GMMs in clustering and classification. 1.4. Current Status GMMs are still extensively applied and continuously supported in modern statistics and machine learning libraries (such as R and Python). Ongoing research continues to refine methodology and applications. 1.5. Relevance Although deep learning has overtaken GMMs in areas like large-scale speech recognition, they remain valuable as interpretable, efficient, and lightweight models for density estimation, clustering, and as components in larger ML pipelines. 1.6. Applications (Historical & Modern) Speech recognition: Before deep learning, GMM–HMM systems were the backbone of acoustic modeling. Computer vision: Widely used for background subtraction and motion detection through pixelwise Gaussian mixtures. General clustering & density modeling: Found in diverse scientific and engineering domains such as biomedical imaging, astronomy, and anomaly detection. 2. Probabilistic Modeling with GMMs for Clustering GMMs (Gaussian Mixture Models) are popular for data clustering. 2.1. What is Clustering? Clustering is a form of unsupervised learning. The goal is to divide multi-dimensional data into clusters, or groups of data points.While this post series is about the task of generating data, understanding data clustering can be very helpful. The main reason is because it demonstrates the importance of probabilistic modeling, as we will see when comparing the K-means algorithm and GMMs for clustering. unclustered data true clusters 2.2. K-Means Clustering An iterative algorithm used for clustering (not probabilistic). Suppose we have a dataset ( mathbf{x} = {mathbf{x}_1,…,mathbf{x}_N} ), the number of observations ( N ), and the dimension of the data (each sample ( mathbf{x}_i )) is ( D ).The goal is to divide the data into ( K ) clusters. For that, the K-means algorithm introduces ( K ) vectors ( boldsymbol{mu}_k ) with dimensionality ( D ), where each vector ( boldsymbol{mu}_k ) represents the mean of cluster ( k ). The algorithm then finds those ( boldsymbol{mu}_k ) and the associated data points such that the distance of each data point to the corresponding mean ( boldsymbol{mu}_k ) is minimal. The association is determined through a parameter ( r_{ik} in {0,1} ), where ( r_{ik} ) is equal to 1 only for the selected cluster, otherwise it is 0.The parameters ( boldsymbol{mu}_k ) and ( r_{ik} ) are found by solving this optimization problem: [ (r,m) = arg min_{r,m} sum_{i}^{n} sum_{k}^{K} r_{ik}|mathbf{x}_i – mathbf{m}_k|^2 ] 2.3. Mixture Models Simplest form of latent variable models (probabilistic). We assume there are ( K ) Gaussian components that, combined, determine the distribution of the data ( mathbf{X} ). There is a discrete latent random variable ( Z ) with realizations ( z in {1,2,…,K} ), where each value ( z ) corresponds to a component of the mixture model. This latent variable simply states that the sample ( mathbf{x}_i ) belongs to the cluster ( k ). As discussed in post 1, we use probability for modeling, so we consider ( P(z=k) ) to be the probability of picking the ( k )-th component. These are also the weights: the contribution of each component to the mixture. ( sum P(z=k) = 1 ). Each component can be a different model, for example, a Gaussian or a Bernoulli. Each model has learnable parameters that define the distribution, for example, ( boldsymbol{mu} ) and ( boldsymbol{Sigma} ) in the case of a Gaussian or ( p ) (probability of success) in the case of a Bernoulli. From a probabilistic perspective, a cluster corresponds to a mixture component. The data distribution is a mixture (summation) of models. Mixture models are represented by the following equation: [ p(mathbf{x}_i|boldsymbol{theta}) = sum_{k=1}^{K} p_Z(mathbf{k}) p_{X|Z}(mathbf{x}_i|mathbf{k},boldsymbol{theta}) ] ( boldsymbol{theta} ) are the learned parameters. ( K ) is a hyperparameter indicating the estimated number of clusters (not learned). ( P(mathbf{x}|z) ) is the probability of observing ( mathbf{x} ) given that it was generated from a specific component ( k ) for ( z=k ). 2.4. GMMs GMMs use the multivariate Gaussian distribution as a model: [ f_{mathbf{X}}(x_1, dots, x_k) = frac{expleft(-frac{1}{2}(mathbf{x}-boldsymbol{mu})^{text{T}}boldsymbol{Sigma}^{-1}(mathbf{x}-boldsymbol{mu})right)}{sqrt{(2pi)^k|boldsymbol{Sigma}|}} ] The corresponding equation is: [ p(mathbf{x}_i|boldsymbol{theta}) = sum_{k=1}^{K} p_Z(mathbf{k}) mathcal{N}(mathbf{x}_i|boldsymbol{mu}_k, boldsymbol{Sigma}_k) ] Now, to simplify notation, ( theta = {mu, Sigma, pi} ). Those are the learnable parameters. 2.5. Transition from K-Means to GMM and Role of Probabilistic Modeling In this context, GMMs are probabilistic models usually presented as a generalization of the K-means algorithm. The fact that K-means is not probabilistic makes the resulting clusters unintuitive and in many cases wrong. That’s why K-means is considered a rigid clustering method while GMMs allow for soft clustering by adding variance to the equation, as shown in the equation below. The corresponding equation is: [ p(mathbf{x}_i|boldsymbol{theta}) = sum_{k=1}^{K} p_Z(mathbf{k}) mathcal{N}(mathbf{x}_i|boldsymbol{mu}_k, boldsymbol{Sigma}_k) ] The mathematical derivation of the transition from K-means to GMM by including a parameter for variance is as follows: [begin{align*}(r,m) &= arg min_{r,m} sum_{i}^{n} sum_{k}^{K} r_{ik}|mathbf{x}_i – mathbf{m}_k|^2 \&= arg max_{r,m} sum_{i}^{n} sum_{k}^{K} r_{ik}(-1/2sigma^{-2}|mathbf{x}_i – mathbf{m}_k|^2) + text{const.} \&= arg max_{r,m} prod_{i} sum_{k} r_{ik} exp (-1/2sigma^{-2}|mathbf{x}_i – mathbf{m}_k|^2) /Z \&= arg max_{r,m} prod_{i} sum_{k} r_{ik}mathcal{N}(mathbf{x}_i; mathbf{m}_k, sigma^2I) \&= arg max_{r,m} p(mathbf{x}|mathbf{m}, r)end{align*}] Line 2: We added a minus sign to go from minimization to maximization, then multiplied by a constant ( sigma ), which does not change the solution (the location of the maximum). The const. term represents any parts of the Gaussian PDF that do not depend on ( (r,m) ). It is

Probabilistic Generative Models Overview

Back to Blog 5 min read This post is the first and introductory post in the series: From Probabilistic Modeling to Generative Modeling.It is my attempt to reproduce the knowledge I gained from learning about probabilistic generative models.In this entry, I break down the general concepts needed to understand five probabilistic generative models:Gaussian Mixture Model (GMM), Variational Autoencoders (VAE), Normalizing Flows (NF), Generative Adversarial Networks (GAN),and Diffusion Models (DM), which will be explained in the next posts. In this series, these models will be presented in a logical order that I find more intuitive, rather than following a strict categorization.However, I will highlight the characteristics that can be used to differentiate between them for clearer and deeper understanding.Before starting, it is necessary to understand why generative modeling is useful… 1 – Why Probabilistic Generative Modeling? To answer this question, we need to understand the difference between discriminative (predictive) and generative models.Let us consider the famous task of image classification using deep neural networks that predict the class of an image: cat or dog.Szegedy et al. (2013) showed that adding noise to the images can result in false predictions. Original ImageP(y=cat|x) = 0.95P(y=dog|x) = 0.05 Noisy ImageP(y=cat|x) = 0.1P(y=dog|x) = 0.9 The classifier predicts the probabilities of the labels (y) given the images (x), which means it learns the conditional probability (p(y|x)).How come adding a limited amount of noise that barely affects the signal results in false probabilities? This shows that discriminative models don’t really understand the image; they only capture useful patterns to make predictions.Once those patterns are changed even slightly, you get nonsense predictions. According to the Deep Learning book: “Classification algorithms can take an input from such a rich high-dimensional distribution and summarize it with a categorical label—what object is in a photo, what word is spoken in a recording, what topic a document is about.The process of classification discards most of the information in the input and produces a single output (or a probability distribution over values of that single output).The classifier is also often able to ignore many parts of the input. For example, when recognizing an object in a photo, it is usually possible to ignore the background of the photo.” However, again quoting from the same book: “The goal of deep learning is to scale machine learning to the kinds of challenges needed to solve artificial intelligence.”In other words, we need models that understand the reality represented by an image—meaning being able to capture the structure of images and have a semantic understanding of what they represent and the whole environment represented in the image. The book gives concrete examples of tasks where this requirement is fundamental, such as density estimation, denoising, missing value imputation, and more relevant in this post series: sampling. 2 – Why Probability? “Probability theory is nothing but common sense reduced to computation.” — Pierre Laplace (1812). Probability is the mathematical toolkit we use to deal with uncertainty.It provides us with the required rules and axioms to quantify uncertain events.As already mentioned, being able to express uncertainty is fundamental for AI systems to make reliable decisions. According to the Deep Learning book, there are two main goals for applying probability theory in AI applications: Start from the laws of probability to specify how the system should behave and design it accordingly. Use probability and statistics as analysis tools to understand the decisions taken by AI systems. 2.1 – Bayesian vs. Frequentist Approach There are two approaches to understanding probability: frequentist and Bayesian. The frequentist interpretation focuses on long-run frequencies of events. For example: if I flip a coin infinitely many times, half will be heads. So it is the long-run probability of success in estimation/decision-making. The Bayesian interpretation is about the degree of belief. From the Bayesian view, the same example would mean: “we believe the coin is equally likely to land heads or tails on the next toss.” So it is more about information rather than repeated trials. A better example to differentiate between them is: “The probability that a candidate will win the elections is 60%.” This is about one occurrence of the event (winning the election), so the frequentist approach is not optimal in this case. For more details, refer to Machine Learning: A Probabilistic Perspective.The main point is that the Bayesian approach is used to quantify uncertainty, which is why it is more appropriate for probabilistic generative models. 2.2 – Most Relevant Rules Needed for This Series Here we focus on the rules needed for higher-level concepts and their applications.If interested in deeper understanding, I recommend this YouTube video from the Machine Learning groups at the University of Tübingen. 2.2.1 – Sum Rule [P(X) = P(X, Y) + P(X, bar{Y})] (P(X)): probability distribution of random variable (X) (P(X, Y)): joint probability distribution of (X) and (Y) (bar{Y} = mathcal{E} – Y), where (mathcal{E}) is the space of events The sum rule can be used to eliminate some random variables. 2.2.2 – Marginalization [P(X) = sum_{y in mathcal{Y}} P(X, y) quad text{if } P text{ is discrete}] [P(X) = int_{y in mathcal{Y}} P(X, y) , dy quad text{if } P text{ is continuous}] Marginalization gives you the probability distribution of (X) if you ignore (Y). 2.2.3 – Conditional Probability [P(Y|X) = frac{P(X,Y)}{P(X)}] (P(Y|X)): conditional probability of (Y) given (X) (P(X)): marginal probability 2.2.4 – Product Rule [P(X,Y) = P(Y|X)P(X) = P(X|Y)cdot P(Y)] The product rule helps you express joint distributions using conditional distributions. 2.2.5 – Law of Total Probability [P(X) = sum_{i=1}^{n} P(X|Y_i) cdot P(Y_i) quad , quad Y_i text{ are disjoint for all } i] 2.2.6 – Bayes’ Theorem [P(Y_i|X) = frac{P(X|Y_i) cdot P(Y_i)}{sum_{j=1}^{n} P(X|Y_j) cdot P(Y_j)}= frac{P(X|Y_i) cdot P(Y_i)}{P(X)}] (P(Y_i|X)): posterior probability of (Y_i) given (X) (P(X|Y_i)): likelihood of (X) under (Y_i) (P(Y_i)): prior probability of (Y_i) (sum_{j=1}^{n} P(X|Y_j) cdot P(Y_j)): normalization term (P(X)): marginal probability of (X) (evidence) 2.2.7 – Bayes’ Theorem in Learning [P(X|D) = frac{P(D|X) cdot P(X)}{sum_{j=1}^{n} P(D|X) cdot P(X)}= frac{P(D|X) cdot P(X)}{P(D)}] (P(X|D)): posterior (P(X)): prior (P(D|X)): likelihood given (X)

Author name: Jasser Kraiem

Gaussian Mixture Models Explained

Probabilistic Generative Models Overview

Falexis