In Bayesian learning, the posterior probability density of a model parameter is estimated from the likelihood function and the prior probability of the parameter. The posterior probability density estimate is refined as more evidence becomes available. However, any non-trivial Bayesian model requires the computation of an intractable integral to obtain the probability density function (PDF) of the evidence. Markov Chain Monte Carlo (MCMC) is a well-known algorithm that solves this problem by directly generating the samples of the posterior distribution without computing this intractable integral. We present a novel perspective of the MCMC algorithm which views the samples of a probability distribution as a dynamical system of Information Theoretic particles in an Information Theoretic field. As our algorithm probes this field with a test particle, it is subjected to Information Forces from other Information Theoretic particles in this field. We use Information Theoretic Learning (ITL) techniques based on Rényi’s α-Entropy function to derive an equation for the gradient of the Information Potential energy of the dynamical system of Information Theoretic particles. Using this equation, we compute the Hamiltonian of the dynamical system from the Information Potential energy and the kinetic energy. The Hamiltonian is used to generate the Markovian state trajectories of the system.
- Hamiltonian Monte Carlo (HMC)
- information theoretic learning
- Kernel density estimator (KDE)
- Markov chain Monte Carlo
- Parzen window
- Rényi’s entropy
- information potential
Bayesian learning involves estimating the PDF of a model parameter from the likelihood function and the prior probability of the parameter. Bayesian inference incorporates the concept of belief where the parameter estimate is refined as more data or evidence becomes available. The posterior PDF of the model parameter with the PDF of the evidence denoted as , is expressed by the following well-known Bayes’ equation:
is the integral of the probability of all possible values of weighted by the likelihood function:
This is an intractable integration for most non-trivial Bayesian inference problems and makes it impossible to compute the posterior probability. The MCMC algorithm described in  provides a solution to this problem by directly generating samples of the posterior PDF without computing this intractable integral. The shape of the posterior PDF and other statistics can be inferred from these samples.
The MCMC algorithm requires knowledge of a function that is proportional to the unknown posterior PDF. It uses this function to generate sample proposals of the unknown PDF. Usually, this function is the product of the likelihood function and the prior probability. In practical applications, one often encounters a system whose outputs are observable, but the process within the system that generated these outputs are unknown. We present a novel perspective on the MCMC method to solve these types of practical problems, where instead of generating the samples of the unknown PDF, it uses the samples of the unknown distribution to estimate the PDF. In this chapter we use the Hamiltonian MCMC (HMC) method described in [2, 3, 4] and ITL concepts to show how the samples of the unknown distribution can be viewed as Information Theoretic particles of a dynamical system. The sample space of the given probability distribution is explored by computing trajectories corresponding to the state transition of this dynamical system. The evolution or state transition of the dynamical system is governed by equations which use the total energy or the Hamiltonian of the system of Information Theoretic particles. Each such particle has an inherent Information Potential by virtue of its position with respect to the other particles of the system. The system of Information Theoretic particles creates an Information Field which enables each particle to exert an Information Force on the other particles. We use ITL techniques  based on Rényi’s α-Entropy function to derive an equation for the gradient of the Information Potential energy of this dynamical system. This equation is one of the main contributions of our work and it is used to compute the Hamiltonian of the system to explore the probability space of the Information Theoretic particles.
In this work, we implement an iterative PDF estimator of an unknown sample distribution, using the HMC method. At every iteration of the estimator, the HMC generates samples such that the mutual information between the generated samples and the given unknown distribution is large. To do this, it uses the Information Potential, the Information Force and the kinetic energy of an Information Theoretic “probe” particle. To compute the Information Potential and the Information Force, the algorithm uses a non-parametric Kernel Density Estimator (KDE). The bandwidth of the KDE determines how close the generated samples are from the unknown sample distribution. At the end of each iteration, the Kullback–Leibler (K–L) divergence of the samples generated by the estimator from the given distribution is computed. The iteration continues until the K-L divergence falls below a specified threshold. We have derived an equation to adapt the kernel bandwidth for each iteration, based on the invariant point theorem. Before starting the next iteration, this equation is used to adapt the kernel bandwidth before generating the next set of samples.
An important application of our algorithm is in machine learning where sometimes the dataset is either too large to fit in the memory of a computer or too small to obtain an accurate inference model. The dataset can be resampled to the desired size using the PDF estimator and the HMC equations derived in this chapter.
The sections in this chapter are organized in the following manner: In Section 2 we review the MCMC algorithm. Section 3 provides an overview of the Hamiltonian MCMC algorithm. Rényi’s Entropy and the concept of Information Theoretic particles are introduced in Section 4. In Section 5 we show how the Hamiltonian MCMC algorithm can be used with Information Theoretic particles and derive a key equation for the system potential gradient. Section 6 describes a method to iteratively estimate the PDF of the target distribution using HMC. In this section we derive an equation to adapt the Information Potential energy estimator bandwidth for each iteration. The simulation results of the HMC algorithm on a system of Information Theoretic particles are listed in Section 7 and we summarize our conclusions in Section 8 of this chapter.
2. Review of the MCMC algorithm
The core principle underlying MCMC techniques is that an ergodic, reversible Markov chain reaches a stationary state. MCMC models the sampling from a distribution as an ergodic and reversible Markov process. When this process reaches a stationary state, the probability distribution of the states of the Markov chain becomes invariant and matches the given probability distribution. The sampling operation in the MCMC is a Markov process that satisfies the following detailed balance equation:
In the detailed balance equation, and are the stationary probability distribution of being in states and respectively and are a sequence of random variables at discrete time indices . The Monte Carlo part of the MCMC algorithm is used to generate random “proposal” samples from a known probability distribution . The proposal sample for the next time step of the MCMC algorithm is dependent on the current proposal sample and the transition probability for the new sample is enforced by an acceptance function. The proposal distribution is usually symmetric to ensure the reversibility of the Markov chain:
Symmetric distributions like the Gaussian distribution or the Uniform distribution centered around the current sample value can be used to generate the proposal sample. There are cases where asymmetric distributions are used but we will focus on symmetric distributions to illustrate our algorithm, without any loss of generality.
To lay the groundwork for the HMC, we review the simple Metropolis-Hastings (MH) MCMC  in this section. The simplest MH algorithm is the Random-Walk MH which uses a symmetrical proposal distribution. It comprises of the following 3 parts:
Generate a proposal sample for the posterior probability from a known symmetric distribution. The new proposal sample is based on the current proposal sample: . For example, if is a Gaussian distribution, it is centered at sample to generate sample
Calculate the acceptance probability by passing this sample through the posterior density function using:E5
Accept the candidate sample with probability or reject it with probability where is defined in (Eq. (8))
If the proposal density function is symmetric, we have:
The acceptance function is derived as follows:
It is evident from (Eq. (7)) that since the acceptance function is a ratio of the posterior probability, the intractable integral to compute the value of is completely bypassed. The acceptance probability of a sample proposal of the MH-MCMC is:
The transition probability of each state of the Markov chain is defined by the acceptance probability. In the stationary state, the product of the Markov chain state probability and the transition probability matrix remains stationary and matches the posterior PDF of the model parameter. The sample points generated by this MCMC in the stationary state of the Markov chain are therefore the sample points of the posterior PDF.
3. The Hamiltonian MCMC algorithm
Instead of the random-walk method of the Metropolis-Hastings algorithm, this MCMC technique uses Hamiltonian dynamics to sample from the posterior PDF. The random-walk method of the Metropolis-Hastings algorithm is inefficient and converges slowly to the target posterior distribution. Instead of randomly generating “proposal” samples from a known probability distribution, the Hamiltonian method uses the dynamics of a physical system to generate these samples. This enables the system to explore the target posterior probability space more efficiently, which in turn results in faster convergence compared to random-walk methods. Hamiltonian dynamics is a concept borrowed from statistical mechanics where the energy of a dynamic system changes from potential energy to kinetic energy and back. The Hamiltonian represents the total energy of the system, which for a closed system, is the sum of its potential and kinetic energy.
As described in [2, 3], Hamiltonian dynamics operates on an dimensional position vector and an dimensional momentum vector and the dynamic system is described by the Hamiltonian . The partial derivatives of the Hamiltonian define how the system evolves with time:
Given the state of the system at time , these equations can be used to determine the state of the system at time where . For the time evolution of the dynamical system, we use the following Hamiltonian:
In (10), is the potential energy and is the kinetic energy of the system. The position vector corresponds to the model parameter and the PDF of is the target posterior PDF that we want to estimate. The potential energy of the Hamiltonian system is expressed as the negative log of the probability of :
To relate the Hamiltonian to the target posterior probability, we use a basic concept from statistical mechanics known as the canonical ensemble. If there are several microstates of a physical system contained in the vector and there is an energy function defined for these microstates, then the canonical probability distribution of the microstates is expressed as:
where is the temperature of the system and the variable is a normalizing constant called the partition function. scales the canonical probability distribution such that it sums to one. For a system described by Hamiltonian dynamics, the energy function is:
In MCMC, the Hamiltonian is an energy function of the states of both the position and the momentum . Therefore, the canonical probability distribution of a Hamiltonian system can be expressed as:
This equation shows that and are independent and each have canonical distributions with energy functions and . The probability density of is the posterior probability density of the model parameter and is the product of the likelihood function of given the data and the prior probability of . An important point to note here is that the momentum variable has been introduced in the probability distribution in (Eq. (14)) so that we can use Hamiltonian dynamics. Since is independent of , we can choose any distribution for this variable. In our HMC algorithms use a zero-mean multivariate Gaussian distribution for the momentum vector . The temperature in this discussion on the HMC.
The kinetic energy of the dynamical system for a unit mass is expressed as:
Since the Hamiltonian equations for the time evolution of the system are differential equations, computer simulation of the HMC must discretize time. A popular scheme to implement this discretization is the “Leapfrog” algorithm . The HMC algorithm uses the leapfrog algorithm to update the momentum and the position while computing the trajectory towards the next sample proposal in the distribution. The Leapfrog integrator has 2 main advantages:
It is time reversible. A Leapfrog integration by steps in the forward direction and then in the backward direction results in the same starting position
It is symplectic in nature. In other words, it conserves the energy of dynamical systems
The steps of the Hamiltonian MCMC algorithm are:
At every time step , determine a trajectory of the system potential and kinetic energy. To do that, generate a random value from a standard normal distribution for the momentum variable.
Execute the Leapfrog algorithm to update the position and momentum variables according to the differential equations in (Eq. (16)). This determines the trajectory of the system towards the next sample proposal
Compute the potential and kinetic energy of the system at the beginning of the trajectory and at the end of the proposed trajectory
Calculate the acceptance probability of the new trajectory using the following ratio of probabilities:E17
Generate a random number to accept or reject the proposal
4. Rényi’s entropy and Information Theoretic particles
The concept of Information Theoretic particles comes from Alfréd Rényi’s pioneering work on generalized measures of entropy and information . At the core of Rényi’s work is the concept of generalized mean or the Kolmogorov-Nagumo (K-N) mean [8, 9, 10]. For numbers , the K-N mean is expressed as:
where, is the K-N function. This function is continuous and strictly monotonic implying that it has an inverse. In the general theory of means, the quasi-linear mean of a random variable which takes the values with probabilities is defined as:
From the theorem on additivity of quasi-linear means , if is a K-N function and is a real constant, then:
if and only if is either linear or exponential.
4.1 Rényi’s entropy
Consider a random variable which takes the values with probabilities . The amount of information generated when takes the value is given by the Hartley  information measurement function :
The expected value of yields the expression for Shannon’s entropy :
For to satisfy the additivity property of independent events, it must satisfywhere is a constant. From (Eq. (20)), this implies that (linear) or (exponential). Setting reduces (Eq. (23)) to the linear mean and yields Shannon entropy equation. Substituting and the corresponding inverse function in (Eq. (23)) yields the expression for Rényi’s entropy:
Rényi’s entropy equation is therefore a general expression for entropy and comprises of a family of entropies for different values of the parameter . Shannon’s entropy is a special case of Rényi’s entropy in the limit as . The argument of the logarithm function in (Eq. (24)) is called the Information Potential. The -Information Potential is expressed as:
The Information Potential in (Eq. (25)) can be written as the expected value of the PDF of the sample distribution raised to :
For in (Eq. (24)), we get Rényi’s quadratic entropy, which has the useful property that it allows us to compute the entropy directly from the samples. The equations for Rényi’s Quadratic Entropy (QE) and Quadratic Information Potential (QIP) are obtained by substituting in (Eqs. (26) and (27)):
The QIP is therefore the expected value of the PDF of the given data samples.
4.2 Rényi’s quadratic information potential (QIP) estimator
From (Eq. (28)), it is evident that to compute the QIP we need to know the PDF of the given data samples. In practical applications an analytical expression of the PDF is rarely available. Therefore, the QIP computation involves a non-parametric estimator of the PDF directly from the samples . The Parzen-Rosenblatt window estimator [15, 16] is a non-parametric way to estimate the PDF of a random variable from its sample values. This estimator places a kernel function with its center at each of the samples. The resulting output values are averaged over all the samples to estimate the PDF. The laws governing the interaction of the Information Theoretic particles is defined by the shape of the kernel. We use a Gaussian kernel, since this kernel when placed over the samples, behaves like an Information Theoretic field whose strength decays with increasing distance between the samples. Just like a charge in space creates an electric field, the samples of a probability distribution behave like Information Particles with unit charge. Information particles exert Information Forces on other particles through this Information Theoretic field.
For scalar samples , the Parzen window PDF estimator with a Gaussian kernel is expressed as:
where is the following standard univariate Gaussian kernel:
is the kernel bandwidth of the estimator and it must be carefully chosen to obtain an accurate and unbiased estimate of the PDF. The Parzen window estimator of a multivariate PDF for vector samples of dimension is expressed as:
where is the following standard multivariate Gaussian kernel:
is the dimension of the input vector , is the covariance matrix and is the determinant of the covariance matrix. For the multivariate PDF case, the kernel bandwidth must be carefully chosen to obtain an accurate and unbiased estimate of the PDF.
Rényi’s quadratic entropy for a continuous random variable is expressed as:
The equation for the QE estimator shows that we can compute the QE estimate directly from the samples of a distribution without knowing its PDF, by applying the Parzen-Rosenblatt kernel on these samples. From (Eqs. (28) and (34)), the QIP estimator can be expressed as:
4.3 Information potential energy and the information force of information theoretic particles
The total QIP energy estimate of the system is given by (Eq. (36)). The QIP energy estimate of sample due to the Information Potential field of a single sample is:
The Quadratic Information Potential energy estimate of scalar sample in the Information Field created by all the samples , for is defined as the average of taken over all the samples :
If the samples are dimensional vectors, then the Quadratic Information Potential energy estimate of vector sample in the Information Potential Field created by all vector samples , for is defined as:
To obtain the Quadratic Information Force estimate on scalar sample due to the Information Potential field of sample , we take the derivative of the Quadratic Information Potential energy estimate:
The Quadratic Information Force on scalar sample in the Information Potential Field created by all the samples , for is defined as the average of taken over all the samples :
If the samples are dimensional vectors, then the Quadratic Information Force on vector sample in the Information Potential Field created by all samples , for is defined as:
5. Hamiltonian MCMC with information theoretic particles
The expression for the potential energy in the Hamiltonian function (Eq. (11)) is similar to the expression for Rényi’s quadratic entropy (Eq. (28)). This is consistent with the principles of statistical mechanics where the entropy is related to the dissipation of the potential energy of the system. Based on this intuition from statistical mechanics, we replace the PDF of the position vector in (Eq. (11)) with the QIP energy estimator as follows:
The change in momentum of the Information Theoretic particle in the dynamical system is equal to the negative potential energy gradient defined in (Eq. (16)). This can be expressed in terms of the QIP energy estimator as:
From the above expression, we derive the expression for the Hamiltonian system’s negative potential gradient in terms of the Information Potential and the Information Force as follows:
This result shows that the gradient of the potential energy of the Hamiltonian system of Information Particles is just the Information Force estimate normalized by the Information Potential energy estimate. This also shows that the Information Force vector influences the trajectory of sample proposals in the HMC algorithm. This equation is one of the important contributions of our work. Our simulation of the HMC of a dynamical system of Information Theoretic particles uses this potential energy gradient equation to evolve the system over time.
5.1 Quality of the information potential energy estimator
As described in , the Information Potential energy estimator is a kernel estimator of the 2-norm of the underlying PDF of the Information Particles. Just like a PDF estimator, we can define metrics to describe the quality of the Information Potential energy estimator. The Mean Integrated Square Error (MISE) is an important metric used to assess the quality of an estimator. This is expressed as:
The bias and variance of the Information Potential estimator can be derived as follows:
Since the Gaussian kernel is symmetric under the expectation operation:
In the above equation is the dummy variable of integration. Let This implies that . Substituting this in (Eq. (51)), we get:
When is small, we can write the Taylor series expansion of as:
Substituting this in (Eq. (52)), we get:
This result implies that as the kernel bandwidth the bias of the Information Potential energy estimator for sample reduces at the rate of . From the above equation it is also evident that the main reason for the bias is the second derivative of the true Information Potential energy (i.e., the rate of curvature of the true PDF of the samples). In other words, if the true PDF of the samples has a sharp spike, the bias of the Information Potential energy estimator will increase. The Information Potential energy estimator tends to smooth out sharp curvatures or spikes in the PDF which increases bias. The amount of smoothness is governed by the bandwidth parameter .
Let This implies that . Substituting this in (Eq. (55)), we get:
When is small, we can write the Taylor series expansion of as:
Substituting this in (Eq. (56)), we get:
This result shows that as the number of samples and kernel bandwidth , the variance of the Information Potential energy estimator for the sample reduces at the rate of . However, as , the variance of the estimator increases. The result also shows that the variance of the estimator is large where the value of the Information Potential energy (i.e., true probability of the sample) is also large. This happens when there are many Information Particles closer together.
5.2 The Kernel bandwidth parameter and the information potential energy estimator bias-variance trade-off
We have shown that the Gaussian kernel bandwidth directly influences the bias and variance of the Information Potential energy estimator. This in turn affects the sample distribution of the PDF estimate generated by the Hamiltonian MCMC. From (Eq. (54)) it is evident that the bias of the estimator reduces when we decrease the kernel bandwidth . However, (Eq. (58)) clearly shows that the decreasing increases the variance of the estimator. Therefore, we must choose an optimum bandwidth which minimizes both the systematic error (bias) and the random error (variance) of the Information Potential energy estimator. An iterative algorithm to converge to the optimum kernel bandwidth is described in the following section.
5.3 Computational complexity of the information potential energy estimator
From (Eqs. (38) and (41)) it may appear that the complexity of computing the Information Potential is . However, as described in , the Information Potential can be written as a symmetric positive Gramm Matrix which can be approximated using the incomplete Cholesky decomposition (ICD) as an matrix where . Using this technique, the time complexity for computing the Information Potential reduces to and the space complexity reduces to .
6. Maximum-likelihood iterative algorithm to adapt the kernel bandwidth of the information potential energy estimator
There are many iterative kernel bandwidth adaptation techniques available in the literature. We present a simple iterative technique to illustrate how MCMC with Hamiltonian of Information Theoretic Particles can be used to adjust the bandwidth parameter of the iterative PDF estimator. Here, we have chosen to minimize the Kullback–Leibler (K-L) divergence between the samples of the estimated PDF and the target sample distribution as the criteria for adapting the kernel bandwidth of the Information Potential energy and Information Force estimator. As described in , this is equivalent to maximizing the likelihood that the estimated PDF samples output by the MCMC, has the same distribution as the target samples.
The ML estimate of the optimum kernel bandwidth for vector Information Particle samples is the solution to the following log-likelihood maximization problem:
Using (Eq. (41)) in the summation of the above equation, we get:
To maximize the above equation, we take the derivative and equate it to 0. This gives us the following update equation for scalar Information Theoretic particles:
In the above equation, is the kernel bandwidth at iteration . It is updated with the result of the right-hand side of the equation obtained at time . This kernel bandwidth update equation (Eq. (61)) is in the form of a fixed-point (or invariant point) equation. This equation is like the equation in  except for the factor of . For vector Information Theoretic particles, the kernel bandwidth update equation is:
In this equation, is the kernel bandwidth matrix and can have unequal elements along its diagonal or non-zero off-diagonal elements. If the kernel bandwidth matrix is constrained to an identity matrix multiplied by a scaling factor, the kernel bandwidth matrix update equation can be expressed as:
From the fixed- or invariant-point theorem, the range over which the fixed-point bandwidth update equations will converge to a unique solution is:
In the above equation, are information particles from the target sample distribution and is the column vector of all the target information particles. From the fixed-point theorem, this fixed-point equation will converge to a unique solution if .
7. Simulation results
The potential energy surface, which is the plot of (Eq. (41)), of a Hamiltonian system of Information Theoretic particles for a bivariate Gaussian distribution is shown in Figure 1. From this figure it is evident that the potential energy surface of the Hamiltonian system has larger values when the Information Theoretic particles are sparse and is lowest at the bottom of the bowl-shaped surface where the particles have the highest density.
The momentum variable of the HMC algorithm occasionally moves the “probe” particle to a higher energy level but the Hamiltonian system has the tendency to fall back to its lowest energy level along the bowl-shaped surface. As a result, the HMC tends to sample the given target distribution more often where the density of the Information Theoretic particles is the largest.
Figure 2 shows the potential energy gradient of the same bivariate Gaussian distribution. This is the plot of (Eq. (47)) for this distribution. Each surface in this figure is one component of the potential energy gradient. Each surface tilts towards the corresponding mean value of the bivariate Gaussian distribution. The figure shows that the potential energy gradient of the Hamiltonian system is lowest near the mean of the distribution and is highest further away from the mean. The time evolution trajectory of the Hamiltonian system lies on this surface.
The iterative PDF estimate of a bivariate Gaussian distribution with using MCMC with 3 different kernel bandwidths is shown in Figure 3.
From Figure 3, it is evident that the MCMC algorithm based on the Hamiltonian of Information Theoretic particles accurately estimates the PDF of the target distribution. The sample points generated by the HMC algorithm covers most of the target samples in this figure. This figure shows that our intuition of comparing the Entropy to the system’s potential and also using the Information Potential in the derivation of the potential gradient (Eq. (47)) of the Hamiltonian system of Information Theoretic particles, was correct.
Our HMC algorithm using Information Theoretic particles also works well for Gaussian mixture distributions. Figure 4 shows that our MCMC algorithm using the Hamiltonian of Information Theoretic particles can be used to iteratively estimate the PDF of different multivariate distributions.
Figure 5 shows that the contour plot of the estimated PDF matches closely to the target PDF. The corresponding samples generated by the HMC algorithm traverses the two clusters of the bivariate Gaussian mixture distribution and covers most of the samples of the target distribution.
We have proposed a novel perspective on the MCMC method where we used it to iteratively estimate the PDF of a given target sample distribution. We have shown that the samples of a probability distribution can be viewed as Information Particles in an Information Field. These particles have Information Potential energy and are subject to Information Forces by virtue of their position in the field. The concept of Information Potential energy fits perfectly within the framework of the Hamiltonian of a dynamical system. We have derived an important result that the gradient of the potential energy of the Hamiltonian system of Information Particles is just the Information Force estimate normalized by the Information Potential energy estimate.
Our simulation results show that our intuition of comparing Rényi’s Quadratic Entropy equation with the Hamiltonian potential energy equation to derive the equation for the potential gradient of a dynamical system of Information Theoretic particles was correct. Using this equation, we were able to accurately estimate univariate and multivariate PDFs. Based on the fixed- or invariant-point theorem, we also derived an equation to iteratively update the bandwidth parameter of the Information Potential and Information Force estimators.
In machine learning applications the dataset is sometimes resampled to the appropriate size before starting the learning operation. Our algorithm can be used to view the data samples as Information Theoretic particles and resample it using the HMC described in this chapter.