The Spectral Mixture (SM) Kernel

The Spectral Mixture (SM) Kernel


Latest

The Spectral Mixture (SM) Kernel for Gaussian Processes

Author: Christoph Würsch, ICE, Eastern Switzerland University of Applied Sciences, OST

Table of Contents

The Gaussian Process as a Distribution over Functions

In the realm of machine learning, most models are parametric; they are defined by a set of parameters θ\boldsymbol{\theta} (e.g., the weights of a neural network), and our goal is to find the optimal θ\boldsymbol{\theta} based on the data. Gaussian Processes (GPs) [1] offer a fundamentally different, non-parametric perspective. Instead of learning parameters for a specific function, a Gaussian Process places a probability distribution directly over the space of all possible functions. This ”distribution over functions” provides a powerful and principled framework for Uncertainty Quantification (UQ). A GP does not just yield a single point prediction yy^*; it provides an entire Gaussian distribution N(μ,σ2)\mathcal{N}(\mu^*, \sigma^{2*}) for any new input x\mathbf{x}^*, where μ\mu^* is the most likely prediction and σ2\sigma^{2*} explicitly quantifies the model’s uncertainty about that prediction. Formally, a Gaussian Process is defined as follows:

Definition A Gaussian Process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution.

A GP is fully specified by a mean function m(x)m(\mathbf{x}) and a covariance function k(x,x)k(\mathbf{x}, \mathbf{x}'):

f(x)GP(m(x),k(x,x))f(\mathbf{x}) \sim \mathcal{GP}\left(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}')\right)

The mean function m(x)=E[f(x)]m(\mathbf{x}) = \mathbb{E}[f(\mathbf{x})] represents the expected value of the function at input x\mathbf{x}. For simplicity, this is often assumed to be zero, m(x)=0m(\mathbf{x}) = 0, after centering the data. The true ”heart” of the GP is the covariance function, or kernel, k(x,x)=Cov(f(x),f(x))k(\mathbf{x}, \mathbf{x}') = \text{Cov}(f(\mathbf{x}), f(\mathbf{x}')). The kernel’s role is to encode our prior assumptions about the function’s properties, such as its smoothness, stationarity, or periodicity. It defines the ”similarity” between the function’s output at two different points, x\mathbf{x} and x\mathbf{x}'. If two points are ”similar” according to the kernel, their function values are expected to be highly correlated.

The Kernel as a High-Dimensional Covariance Matrix

The definition of a GP elegantly connects the infinite-dimensional concept of a ”function” to the finite-dimensional mathematics we use for computation. Given a finite dataset of NN input points, X={x1,,xN}\mathbf{X} = \{\mathbf{x}_1, \dots, \mathbf{x}_N\}, the joint distribution of the corresponding function values f=[f(x1),,f(xN)]T\mathbf{f} = [f(\mathbf{x}_1), \dots, f(\mathbf{x}_N)]^T is, by definition, a high-dimensional Multivariate Gaussian. The covariance matrix K\mathbf{K} of this Multivariate Gaussian is constructed by evaluating the kernel function at every pair of points in our dataset:

K=(k(x1,x1)k(x1,x2)k(x1,xN)k(x2,x1)k(x2,x2)k(x2,xN)k(xN,x1)k(xN,x2)k(xN,xN))\mathbf{K} = \begin{pmatrix} k(\mathbf{x}_1, \mathbf{x}_1) & k(\mathbf{x}_1, \mathbf{x}_2) & \dots & k(\mathbf{x}_1, \mathbf{x}_N) \\ k(\mathbf{x}_2, \mathbf{x}_1) & k(\mathbf{x}_2, \mathbf{x}_2) & \dots & k(\mathbf{x}_2, \mathbf{x}_N) \\ \vdots & \vdots & \ddots & \vdots \\ k(\mathbf{x}_N, \mathbf{x}_1) & k(\mathbf{x}_N, \mathbf{x}_2) & \dots & k(\mathbf{x}_N, \mathbf{x}_N) \end{pmatrix}

Thus, our prior belief about the function ff at the points X\mathbf{X} is expressed as:

p(fX)=N(f0,K)p(\mathbf{f} | \mathbf{X}) = \mathcal{N}(\mathbf{f} | \mathbf{0}, \mathbf{K})

The choice of kernel kk entirely determines the N×NN \times N covariance matrix K\mathbf{K}, which in turn defines the properties of all functions sampled from this prior.

Gaussian Process Regression showing aleatoric uncertainty where there is data and epistemic uncertainty in data gaps.

The RBF Kernel: The ”Workhorse” of GPs

While countless kernels exist, the most common and arguably most important is the Radial Basis Function (RBF) kernel, also known as the Squared Exponential or Gaussian kernel. Its enduring popularity stems from its simplicity and its strong, yet often reasonable, prior assumption: smoothness. The RBF kernel is defined as:

kRBF(xi,xj)=σf2exp(xixj22l2)k_{\text{RBF}}(\mathbf{x}_i, \mathbf{x}_j) = \sigma_f^2 \exp\left( -\frac{||\mathbf{x}_i - \mathbf{x}_j||^2}{2l^2} \right)

This kernel is defined by two hyperparameters:

  • Signal Variance (σf2\sigma_f^2): This is an amplitude parameter. It controls the total variance of the function, scaling the covariance values up or down. It represents the maximum expected variation from the mean.
  • Lengthscale (ll): This is the most critical parameter. It defines the ”characteristic distance” over which the function’s values are correlated.

The RBF kernel is stationary (it depends only on the distance xixj||\mathbf{x}_i - \mathbf{x}_j||, not on the absolute locations) and infinitely smooth. A small lengthscale ll means the correlation drops off quickly, allowing the function to vary rapidly. A large ll means that even distant points are correlated, resulting in a very smooth, slowly varying function. When we use an RBF kernel, we are placing a prior on functions that we believe to be smooth. The core idea of GP regression, which we will explore in the following sections, is to take this high-dimensional Gaussian prior N(0,K)\mathcal{N}(\mathbf{0}, \mathbf{K})built from our RBF kerneland condition it on our finite set of noisy observations D={(X,y)}\mathcal{D} = \{(\mathbf{X}, \mathbf{y})\}. This conditioning step updates our prior belief, yielding a posterior Gaussian Process that provides both the mean prediction and, crucially for UQ, the predictive uncertainty at any unobserved location.

The Limitations of Standard Kernels

In the application of Gaussian Processes (GPs), we have seen that the choice of kernel k(x,x)k(\mathbf{x}, \mathbf{x}') is paramount. It encodes all our prior assumptions about the function f(x)f(\mathbf{x}) we are modeling. A popular and widely-used kernel is the Radial Basis Function (RBF) kernel, also known as the Squared Exponential:

kRBF(τ)=σf2exp(τ22l2)k_{\text{RBF}}(\tau) = \sigma_f^2 \exp\left( -\frac{\tau^2}{2l^2} \right)

where τ=xx\tau = ||\mathbf{x} - \mathbf{x}'|| is the distance between inputs, σf2\sigma_f^2 is the signal variance, and ll is the characteristic lengthscale. The RBF kernel is a ”universal” kernel in the sense that it can approximate any continuous function given enough data. However, its implicit assumptions are very strong. The RBF kernel is infinitely smooth and stationary. Its primary assumption is that ”nearness” in the input space implies ”nearness” in the output space. This makes it an excellent choice for interpolation and smoothing.

Its weakness, however, is revealed in extrapolation and pattern discovery. Consider the famous Mauna Loa CO2CO_2 dataset, which exhibits a clear long-term upward trend combined with a strong annual periodicity.

  • An RBF kernel can model the long-term trend, but it will fail to capture the periodicity. When asked to extrapolate, its prediction will revert to the mean.
  • A Periodic kernel can capture the seasonality, but it must be multiplied by an RBF kernel to allow for local variations, and crucially, the user must specify the period (e.g., 1 year) in advance. This motivates the need for a kernel that does not just apply a predefined structure (like ”smooth” or ”periodic with period PP”) but can discover complex, quasi-periodic, and multi-scale patterns directly from the data.

The Spectral Mixture (SM) Kernel [2]

The Spectral Perspective: Bochner’s Theorem

The key to developing a more expressive kernel lies in the frequency domain. A fundamental result in mathematics, Bochner’s Theorem, provides the bridge.

Theorem (Bochner, 1959) A complex-valued function k(τ)k(\tau) on RD\mathbb{R}^D is the covariance function of a stationary Gaussian Process if and only if it is the Fourier transform of a non-negative, finite measure S(ω)S(\boldsymbol{\omega}).

For a stationary kernel k(τ)k(\boldsymbol{\tau}) where τ=xx\boldsymbol{\tau} = \mathbf{x} - \mathbf{x}', this relationship is:

k(τ)=RDS(ω)ei2πωTτdωandS(ω)=RDk(τ)ei2πωTτdτk(\boldsymbol{\tau}) = \int_{\mathbb{R}^D} S(\boldsymbol{\omega}) e^{i 2\pi \boldsymbol{\omega}^T \boldsymbol{\tau}} d\boldsymbol{\omega} \quad \text{and} \quad S(\boldsymbol{\omega}) = \int_{\mathbb{R}^D} k(\boldsymbol{\tau}) e^{-i 2\pi \boldsymbol{\omega}^T \boldsymbol{\tau}} d\boldsymbol{\tau}

S(ω)S(\boldsymbol{\omega}) is known as the spectral density (or power spectrum) of the kernel. It describes the ”power” or ”strength” of the function at different frequencies ω\boldsymbol{\omega}. Let’s re-examine our standard kernels through this lens:

  • RBF Kernel: The spectral density of kRBF(τ)k_{\text{RBF}}(\tau) is also a Gaussian, centered at zero frequency: SRBF(ω)exp(l2ω22)S_{\text{RBF}}(\omega) \propto \exp\left( -\frac{l^2 \omega^2}{2} \right) This mathematically confirms our intuition: the RBF kernel is a strong low-pass filter. It assumes all the function’s power is concentrated at low frequencies (i.e., the function is smooth).
  • Periodic Kernel: The spectral density of a purely periodic kernel is a series of delta functions δ(ω)\delta(\omega) at the fundamental frequency and its harmonics. This is an extremely rigid structure.

The problem is clear: standard kernels have fixed, simple spectral densities. If the true function’s spectrum is complex (e.g., a mix of several periodicities), these kernels will fail to capture it.

The Spectral Mixture (SM) Kernel

The groundbreaking idea of the Spectral Mixture (SM) kernel, introduced by Wilson and Adams (2013) [2], is this: Instead of assuming a fixed spectral density, let’s learn it from the data.

How can we model an arbitrary, non-negative spectral density S(ω)S(\omega)? We can use a Gaussian Mixture Model (GMM), which is a universal approximator for densities. The SM kernel proposes to model the spectral density S(ω)S(\omega) as a sum of QQ Gaussian components. To ensure the resulting kernel k(τ)k(\tau) is real-valued, the spectrum must be symmetric, S(ω)=S(ω)S(\omega) = S(-\omega). We thus define the spectral density as a mixture of QQ pairs of symmetric Gaussians:

SSM(ω)=q=1Qwq2[N(ωμq,vq)+N(ωμq,vq)]S_{\text{SM}}(\omega) = \sum_{q=1}^Q \frac{w_q}{2} \left[ \mathcal{N}(\omega | \mu_q, v_q) + \mathcal{N}(\omega | -\mu_q, v_q) \right]

where each component qq has a weight wqw_q, a mean frequency μq\mu_q, and a spectral variance vqv_q. Now, we apply Bochner’s theorem and compute the inverse Fourier transform of SSM(ω)S_{\text{SM}}(\omega) to find the kernel kSM(τ)k_{\text{SM}}(\tau). We use the standard Fourier transform pair:

F1[N(ωμ,v)](τ)exp(2π2vτ2)exp(i2πμτ)\mathcal{F}^{-1}\left[ \mathcal{N}(\omega | \mu, v) \right](\tau) \propto \exp(-2\pi^2 v \tau^2) \cdot \exp(i 2\pi \mu \tau)

By applying this to our symmetric GMM and using Euler’s formula (cos(x)=eix+eix2\cos(x) = \frac{e^{ix} + e^{-ix}}{2}), the sum of the qq-th pair of complex exponentials simplifies beautifully into a cosine function. This yields the Spectral Mixture (SM) Kernel for 1D inputs (τ=xx\tau = x - x'):

Spectral Mixture (SM) Kernel (1D)

kSM(τ)=q=1Qwqexp(2π2vqτ2)RBF-like Componentcos(2πμqτ)Periodic Componentk_{\text{SM}}(\tau) = \sum_{q=1}^Q w_q \cdot \underbrace{\exp(-2\pi^2 v_q \tau^2)}_{\text{RBF-like Component}} \cdot \underbrace{\cos(2\pi \mu_q \tau)}_{\text{Periodic Component}}

This kernel is a linear combination of QQ components, where each component is a product of an RBF-like kernel and a periodic cosine kernel. The power of the SM kernel comes from the interpretability of its hyperparameters. For each of the QQ components, we learn:

  • Weight (wqw_q): The component’s overall contribution (amplitude) to the final covariance.
  • Mean Frequency (μq\mu_q): The center of the component in the frequency domain. This directly controls the periodicity. The period PqP_q is Pq=1/μqP_q = 1 / \mu_q.
  • Spectral Variance (vqv_q): The variance of the component in the frequency domain. This controls the lengthscale lql_q in the time domain (lq1/vql_q \propto 1/\sqrt{v_q}).

This structure allows the kernel to model diverse patterns:

  • A smooth trend: A component with μq0\mu_q \approx 0 (zero frequency) and large vqv_q (short lengthscale) behaves like a standard RBF kernel.
  • A strong, stable periodicity: A component with μq>0\mu_q > 0 and very small vqv_q (long lengthscale) behaves like cos(2πμqτ)\cos(2\pi \mu_q \tau).
  • A quasi-periodic function: A component with μq>0\mu_q > 0 and vq>0v_q > 0 models a wave that slowly ”decays” or changes shape over time.

The kernel can be extended to DD-dimensional inputs xRD\mathbf{x} \in \mathbb{R}^D. We define τ=xx\boldsymbol{\tau} = \mathbf{x} - \mathbf{x}'. Each component qq now requires a weight wqw_q, a mean frequency vector μqRD\boldsymbol{\mu}_q \in \mathbb{R}^D, and a spectral variance vector vqRD\mathbf{v}_q \in \mathbb{R}^D (assuming a diagonal covariance Vq=diag(vq)\mathbf{V}_q = \text{diag}(\mathbf{v}_q)). The multivariate kernel is given by:

Spectral Mixture (SM) Kernel (D-Dimensions)

kSM(τ)=q=1Qwqexp(2π2τTVqτ)cos(2πμqTτ)k_{\text{SM}}(\boldsymbol{\tau}) = \sum_{q=1}^Q w_q \exp(-2\pi^2 \boldsymbol{\tau}^T \mathbf{V}_q \boldsymbol{\tau}) \cdot \cos(2\pi \boldsymbol{\mu}_q^T \boldsymbol{\tau})

This is a sum of QQ components, where each component kqk_q is a product over the DD dimensions:

kq(τ)=wqd=1Dexp(2π2vqdτd2)cos(2πμqdτd)k_q(\boldsymbol{\tau}) = w_q \prod_{d=1}^D \exp(-2\pi^2 v_{qd} \tau_d^2) \cdot \cos(2\pi \mu_{qd} \tau_d)

This structure allows the model to learn different frequencies (e.g., ”daily” in dimension 1, ”weekly” in dimension 2) and different lengthscales for each component in each input dimension.

Pattern Discovery and Extrapolation

The most significant feature of the SM kernel is that the parameters {wq,μq,vq}q=1Q\{w_q, \boldsymbol{\mu}_q, \mathbf{v}_q\}_{q=1}^Q are all hyperparameters of the kernel. Just as we optimize the lengthscale ll of an RBF kernel by maximizing the marginal log-likelihood (ML-II), we optimize all Q×(2D+1)Q \times (2D + 1) parameters of the SM kernel. This optimization process is the pattern discovery. The GP automatically fits the GMM to the data’s ”hidden” spectral density.

Example: The CO2CO_2 Dataset

If we apply an SM kernel (e.g., with Q=2Q=2) to the Mauna Loa CO2CO_2 data, the optimization will likely find:

  • Component 1 (Trend): w1highw_1 \approx \text{high}, μ10\boldsymbol{\mu}_1 \approx \mathbf{0}, v1smallv_1 \approx \text{small} (long lengthscale) This component captures the long-term, non-periodic, smooth upward trend.
  • Component 2 (Seasonality): w2mediumw_2 \approx \text{medium}, μ21/year\mu_2 \approx 1/\text{year}, v2very smallv_2 \approx \text{very small} (very long lengthscale). This component explicitly discovers and isolates the 1-year annual cycle.

Because the model has explicitly learned the periodic component cos(2πμ2τ)\cos(2\pi \mu_2 \tau), its predictions will extrapolate this pattern into the future, rather than reverting to the mean. This provides state-of-the-art extrapolation for any time-series data that exhibits quasi-periodic behavior (e.g., climate, finance, robotics).

Spectral Mixture Kernel Gaussian Process regression on the Mauna Loa \text{CO}_2 data.

Implementation and Practical Issues

The SM kernel is available in modern probabilistic programming libraries such as GPyTorch.

import torch
import gpytorch

class SpectralMixtureGP(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood, num_mixtures):
    super(SpectralMixtureGP, self).__init__(train_x, train_y, likelihood)

        self.mean_module = gpytorch.means.ConstantMean()

        # Define the Spectral Mixture Kernel
        # num_mixtures is Q
        # ard_num_dims is D (here, 1D input)
        self.covar_module = gpytorch.kernels.SpectralMixtureKernel(
        num_mixtures=num_mixtures,
        ard_num_dims=1
        )

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

#  Example Usage 
# Assume train_x, train_y are 1D tensors
likelihood = gpytorch.likelihoods.GaussianLikelihood()
Q = 4 # We choose to search for 4 components
model = SpectralMixtureGP(train_x, train_y, likelihood, num_mixtures=Q)

# !!
CRITICAL STEP: Initialization !!
# The SM loss landscape is highly multi-modal.
# Random initialization will fail.
We must initialize
# from the data's frequency spectrum (periodogram).
model.covar_module.initialize_from_data(train_x, train_y)

# Now, we train the model as usual
# ... (training loop) ...

A Critical Caveat: InitializationThe SM kernel’s greatest strength (its flexibility) is also its greatest weakness. With Q×(2D+1)Q \times (2D + 1) hyperparameters, the optimization landscape for the marginal log-likelihood is highly multi-modal. A naive or random initialization of the parameters {wq,μq,vq}\{w_q, \boldsymbol{\mu}_q, \mathbf{v}_q\} will almost certainly get stuck in a poor local optimum. This kernel is not ”plug-and-play” like an RBF kernel.To use it successfully, one must initialize the parameters intelligently. Libraries like GPyTorch and GPy provide helper functions (e.g., initialize_from_data) that perform the following steps:

  1. Compute the empirical spectral density of the data using a periodogram (e.g., via a Fast Fourier Transform, FFT).
  2. Fit a GMM to this empirical spectrum.
  3. Use the parameters of the fitted GMM (the weights, means, and variances) as the starting values for the kernel’s hyperparameters.

This provides the optimizer with a ”good guess” that is already close to a reasonable solution, allowing it to fine-tune the parameters to the true maximum of the marginal log-likelihood.

Take Aways

The Spectral Mixture kernel is a powerful, expressive tool for Gaussian Process modeling, effectively moving the problem from kernel selection to kernel learning.

AdvantagesDisadvantages
  • Automated Pattern Discovery: Learns complex structures (trends, periodicities) automatically.
  • Excellent Extrapolation: Can project quasi-periodic patterns, unlike RBF.
  • Universal Approximator: Can approximate any stationary kernel given enough components QQ.
  • Interpretable: Learned parameters μq,vq\boldsymbol{\mu}_q, \mathbf{v}_q reveal the data’s spectral properties.
  • Difficult Optimization: Highly multi-modal loss landscape.
  • Initialization is Critical: Requires smart initialization from a periodogram; random init will fail.
  • Many Hyperparameters: Q(2D+1)Q(2D+1) parameters can lead to overfitting if QQ is too high or data is scarce.
  • Computationally Slower: More complex kernel calculation than RBF.

When faced with data that exhibits complex, unknown, or quasi-periodic patterns, the SM kernel is one of the most powerful tools in the modern GP toolbox, provided it is used with a proper initialization strategy.

References

[1] Carl Edward Rasmussen and Chris Williams: Gaussian Processes for Machine Learning, the MIT Press, 2006, online

[2] A. G. Wilson and R. P. Adams. ”Gaussian Process Kernels for Pattern Discovery and Extrapolation.” Proceedings of the 30th International Conference on Machine Learning (ICML), 2013.