Open access peer-reviewed chapter

Markov Chain Monte Carlo in a Dynamical System of Information Theoretic Particles

Written By

Tokunbo Ogunfunmi and Manas Deb

Submitted: 09 September 2021 Reviewed: 13 September 2021 Published: 04 November 2021

DOI: 10.5772/intechopen.100428

From the Edited Volume

The Monte Carlo Methods - Recent Advances, New Perspectives and Applications

Edited by Abdo Abou Jaoudé

Chapter metrics overview

244 Chapter Downloads

View Full Metrics

Abstract

In Bayesian learning, the posterior probability density of a model parameter is estimated from the likelihood function and the prior probability of the parameter. The posterior probability density estimate is refined as more evidence becomes available. However, any non-trivial Bayesian model requires the computation of an intractable integral to obtain the probability density function (PDF) of the evidence. Markov Chain Monte Carlo (MCMC) is a well-known algorithm that solves this problem by directly generating the samples of the posterior distribution without computing this intractable integral. We present a novel perspective of the MCMC algorithm which views the samples of a probability distribution as a dynamical system of Information Theoretic particles in an Information Theoretic field. As our algorithm probes this field with a test particle, it is subjected to Information Forces from other Information Theoretic particles in this field. We use Information Theoretic Learning (ITL) techniques based on Rényi’s α-Entropy function to derive an equation for the gradient of the Information Potential energy of the dynamical system of Information Theoretic particles. Using this equation, we compute the Hamiltonian of the dynamical system from the Information Potential energy and the kinetic energy. The Hamiltonian is used to generate the Markovian state trajectories of the system.

Keywords

  • Hamiltonian Monte Carlo (HMC)
  • information theoretic learning
  • Kernel density estimator (KDE)
  • Markov chain Monte Carlo
  • Parzen window
  • Rényi’s entropy
  • information potential

1. Introduction

Bayesian learning involves estimating the PDF of a model parameter from the likelihood function and the prior probability of the parameter. Bayesian inference incorporates the concept of belief where the parameter estimate is refined as more data or evidence becomes available. The posterior PDF of the model parameter θ with the PDF of the evidence X denoted as PX, is expressed by the following well-known Bayes’ equation:

PθX=PXθPθPXE1

PX is the integral of the probability of all possible values of θ weighted by the likelihood function:

PX=θPXθPθE2

This is an intractable integration for most non-trivial Bayesian inference problems and makes it impossible to compute the posterior probability. The MCMC algorithm described in [1] provides a solution to this problem by directly generating samples of the posterior PDF without computing this intractable integral. The shape of the posterior PDF and other statistics can be inferred from these samples.

The MCMC algorithm requires knowledge of a function that is proportional to the unknown posterior PDF. It uses this function to generate sample proposals of the unknown PDF. Usually, this function is the product of the likelihood function and the prior probability. In practical applications, one often encounters a system whose outputs are observable, but the process within the system that generated these outputs are unknown. We present a novel perspective on the MCMC method to solve these types of practical problems, where instead of generating the samples of the unknown PDF, it uses the samples of the unknown distribution to estimate the PDF. In this chapter we use the Hamiltonian MCMC (HMC) method described in [2, 3, 4] and ITL concepts to show how the samples of the unknown distribution can be viewed as Information Theoretic particles of a dynamical system. The sample space of the given probability distribution is explored by computing trajectories corresponding to the state transition of this dynamical system. The evolution or state transition of the dynamical system is governed by equations which use the total energy or the Hamiltonian of the system of Information Theoretic particles. Each such particle has an inherent Information Potential by virtue of its position with respect to the other particles of the system. The system of Information Theoretic particles creates an Information Field which enables each particle to exert an Information Force on the other particles. We use ITL techniques [5] based on Rényi’s α-Entropy function to derive an equation for the gradient of the Information Potential energy of this dynamical system. This equation is one of the main contributions of our work and it is used to compute the Hamiltonian of the system to explore the probability space of the Information Theoretic particles.

In this work, we implement an iterative PDF estimator of an unknown sample distribution, using the HMC method. At every iteration of the estimator, the HMC generates samples such that the mutual information between the generated samples and the given unknown distribution is large. To do this, it uses the Information Potential, the Information Force and the kinetic energy of an Information Theoretic “probe” particle. To compute the Information Potential and the Information Force, the algorithm uses a non-parametric Kernel Density Estimator (KDE). The bandwidth of the KDE determines how close the generated samples are from the unknown sample distribution. At the end of each iteration, the Kullback–Leibler (K–L) divergence of the samples generated by the estimator from the given distribution is computed. The iteration continues until the K-L divergence falls below a specified threshold. We have derived an equation to adapt the kernel bandwidth for each iteration, based on the invariant point theorem. Before starting the next iteration, this equation is used to adapt the kernel bandwidth before generating the next set of samples.

An important application of our algorithm is in machine learning where sometimes the dataset is either too large to fit in the memory of a computer or too small to obtain an accurate inference model. The dataset can be resampled to the desired size using the PDF estimator and the HMC equations derived in this chapter.

The sections in this chapter are organized in the following manner: In Section 2 we review the MCMC algorithm. Section 3 provides an overview of the Hamiltonian MCMC algorithm. Rényi’s Entropy and the concept of Information Theoretic particles are introduced in Section 4. In Section 5 we show how the Hamiltonian MCMC algorithm can be used with Information Theoretic particles and derive a key equation for the system potential gradient. Section 6 describes a method to iteratively estimate the PDF of the target distribution using HMC. In this section we derive an equation to adapt the Information Potential energy estimator bandwidth for each iteration. The simulation results of the HMC algorithm on a system of Information Theoretic particles are listed in Section 7 and we summarize our conclusions in Section 8 of this chapter.

Advertisement

2. Review of the MCMC algorithm

The core principle underlying MCMC techniques is that an ergodic, reversible Markov chain reaches a stationary state. MCMC models the sampling from a distribution as an ergodic and reversible Markov process. When this process reaches a stationary state, the probability distribution of the states of the Markov chain becomes invariant and matches the given probability distribution. The sampling operation in the MCMC is a Markov process that satisfies the following detailed balance equation:

πiPXt1=iXt=j=πjPXt1=jXt=ii,jE3

In the detailed balance equation, πi and πj are the stationary probability distribution of being in states i and j respectively and X0,X1,X2,Xt are a sequence of random variables at discrete time indices 0,1,2,t1,t,. The Monte Carlo part of the MCMC algorithm is used to generate random “proposal” samples from a known probability distribution QX. The proposal sample for the next time step of the MCMC algorithm is dependent on the current proposal sample and the transition probability for the new sample is enforced by an acceptance function. The proposal distribution is usually symmetric to ensure the reversibility of the Markov chain:

Qxtxt1=Qxt1xtE4

Symmetric distributions like the Gaussian distribution or the Uniform distribution centered around the current sample value can be used to generate the proposal sample. There are cases where asymmetric distributions are used but we will focus on symmetric distributions to illustrate our algorithm, without any loss of generality.

To lay the groundwork for the HMC, we review the simple Metropolis-Hastings (MH) MCMC [6] in this section. The simplest MH algorithm is the Random-Walk MH which uses a symmetrical proposal distribution. It comprises of the following 3 parts:

  1. Generate a proposal sample for the posterior probability from a known symmetric distribution. The new proposal sample is based on the current proposal sample: xproposalQxixi1. For example, if QX is a Gaussian distribution, it is centered at sample xi1 to generate sample xi

  2. Calculate the acceptance probability by passing this sample through the posterior density function using:

    PθX=1ZPXθPθwhereZ=θPZθPθE5

  3. Accept the candidate sample with probability α or reject it with probability 1α where α is defined in (Eq. (8))

If the proposal density function is symmetric, we have:

Qxi1xproposal=Qxproposalxi1E6

The acceptance function is derived as follows:

E7

It is evident from (Eq. (7)) that since the acceptance function is a ratio of the posterior probability, the intractable integral to compute the value of Z is completely bypassed. The acceptance probability of a sample proposal of the MH-MCMC is:

α=min1PX=xproposalθPX=xi1θE8

The transition probability of each state of the Markov chain is defined by the acceptance probability. In the stationary state, the product of the Markov chain state probability and the transition probability matrix remains stationary and matches the posterior PDF of the model parameter. The sample points xi generated by this MCMC in the stationary state of the Markov chain are therefore the sample points of the posterior PDF.

Advertisement

3. The Hamiltonian MCMC algorithm

Instead of the random-walk method of the Metropolis-Hastings algorithm, this MCMC technique uses Hamiltonian dynamics to sample from the posterior PDF. The random-walk method of the Metropolis-Hastings algorithm is inefficient and converges slowly to the target posterior distribution. Instead of randomly generating “proposal” samples from a known probability distribution, the Hamiltonian method uses the dynamics of a physical system to generate these samples. This enables the system to explore the target posterior probability space more efficiently, which in turn results in faster convergence compared to random-walk methods. Hamiltonian dynamics is a concept borrowed from statistical mechanics where the energy of a dynamic system changes from potential energy to kinetic energy and back. The Hamiltonian represents the total energy of the system, which for a closed system, is the sum of its potential and kinetic energy.

As described in [2, 3], Hamiltonian dynamics operates on an N dimensional position vector q and an N dimensional momentum vector p and the dynamic system is described by the Hamiltonian Hqp. The partial derivatives of the Hamiltonian define how the system evolves with time:

dqidt=Hpii=1,2,,Ndpidt=HqiE9

Given the state of the system at time t, these equations can be used to determine the state of the system at time t+T where T=1,2,3,. For the time evolution of the dynamical system, we use the following Hamiltonian:

Hqp=Uq+KpE10

In (10), Uq is the potential energy and Kp is the kinetic energy of the system. The position vector q corresponds to the model parameter and the PDF of q is the target posterior PDF that we want to estimate. The potential energy of the Hamiltonian system is expressed as the negative log of the probability of q:

Uq=logPqE11

To relate the Hamiltonian Hqp to the target posterior probability, we use a basic concept from statistical mechanics known as the canonical ensemble. If there are several microstates of a physical system contained in the vector θ and there is an energy function Eθ defined for these microstates, then the canonical probability distribution of the microstates is expressed as:

pθ=1ZeEθTE12

where T is the temperature of the system and the variable Z is a normalizing constant called the partition function. Z scales the canonical probability distribution such that it sums to one. For a system described by Hamiltonian dynamics, the energy function is:

Eθ=Hqp=Uq+KpE13

In MCMC, the Hamiltonian is an energy function of the states of both the position q and the momentum p. Therefore, the canonical probability distribution of a Hamiltonian system can be expressed as:

Pqp=1ZeHqpT=1ZeUq+KpT=1ZexpUqTexpKpTE14

This equation shows that q and p are independent and each have canonical distributions with energy functions Uq and Kp. The probability density of q is the posterior probability density of the model parameter θ and is the product of the likelihood function of θ given the data D and the prior probability of θ. An important point to note here is that the momentum variable p has been introduced in the probability distribution in (Eq. (14)) so that we can use Hamiltonian dynamics. Since p is independent of q, we can choose any distribution for this variable. In our HMC algorithms use a zero-mean multivariate Gaussian distribution for the momentum vector p. The temperature T=1 in this discussion on the HMC.

The kinetic energy of the dynamical system for a unit mass is expressed as:

Kp=12pTpE15

On applying the Hamiltonian partial derivatives in (9) to the definition of the HMC in (10) we get the following differential equations which describe the time evolution of the dynamical system:

dqdt=Hp=Uq+Kpp=p12pTp=pdpdt=Hq=Uq+Kpq=UqqE16

Since the Hamiltonian equations for the time evolution of the system are differential equations, computer simulation of the HMC must discretize time. A popular scheme to implement this discretization is the “Leapfrog” algorithm [4]. The HMC algorithm uses the leapfrog algorithm to update the momentum and the position while computing the trajectory towards the next sample proposal in the distribution. The Leapfrog integrator has 2 main advantages:

  1. It is time reversible. A Leapfrog integration by N steps in the forward direction and then in the backward direction results in the same starting position

  2. It is symplectic in nature. In other words, it conserves the energy of dynamical systems

The steps of the Hamiltonian MCMC algorithm are:

  1. At every time step t, determine a trajectory of the system potential and kinetic energy. To do that, generate a random value from a standard normal distribution for the momentum variable.

  2. Execute the Leapfrog algorithm to update the position and momentum variables according to the differential equations in (Eq. (16)). This determines the trajectory of the system towards the next sample proposal

  3. Compute the potential and kinetic energy Uqt1Kpt1of the system at the beginning of the trajectory and at the end UqproposedKpproposed of the proposed trajectory

  4. Calculate the acceptance probability of the new trajectory using the following ratio of probabilities:

    β=min1PqproposalpproposalPqt1pt1=min11ZexpUqproposalexpKpproposal1ZexpUqt1expKpt1=min1expUqt1+Kpt1Uqproposal+KpproposalE17

  5. Generate a random number uUniform01 to accept or reject the proposal

ifβ>uthenqtqproposal//accept the proposed trajectoryelseqtqt1//reject the proposed trajectoryendif
Advertisement

4. Rényi’s entropy and Information Theoretic particles

The concept of Information Theoretic particles comes from Alfréd Rényi’s pioneering work on generalized measures of entropy and information [7]. At the core of Rényi’s work is the concept of generalized mean or the Kolmogorov-Nagumo (K-N) mean [8, 9, 10]. For numbers x1,x2,xN, the K-N mean is expressed as:

ψ11Ni=1NψxiE18

where, ψ. is the K-N function. This function is continuous and strictly monotonic implying that it has an inverse. In the general theory of means, the quasi-linear mean of a random variable X which takes the values x1,x2,xN with probabilities p1,p2,pN is defined as:

EψX=Xψ=ψ1k=1NpkψxkE19

From the theorem on additivity of quasi-linear means [11], if ψ. is a K-N function and cis a real constant, then:

ψ1k=1Npkψxk+c=ψ1k=1Npkψxk+cE20

if and only if ψ. is either linear or exponential.

4.1 Rényi’s entropy

Consider a random variable X which takes the values x1,x2,xN with probabilities p1,p2,pN. The amount of information generated when X takes the value xk is given by the Hartley [12] information measurement function Ixk:

Ixk=log21pkbitsE21

The expected value of Ixk yields the expression for Shannon’s entropy [13]:

HX=k=1NpkIxk=k=1Npklog21pkE22

Rényi replaced the linear mean in (Eq. (22)) with the quasi-linear mean in (Eq. (19)) to obtain a generalized measure of information:

HψX=ψ1k=1Npkψlog21pkE23

For HψX to satisfy the additivity property of independent events, it must satisfyX+cψ=Xψ+c where c is a constant. From (Eq. (20)), this implies that ψx=cx (linear) or ψx=c21αx (exponential). Setting ψx=cx reduces (Eq. (23)) to the linear mean and yields Shannon entropy equation. Substituting ψx=c21αx and the corresponding inverse function ψ1=11αlog2 in (Eq. (23)) yields the expression for Rényi’s αentropy:

HαX=11αlog2k=1Npkαα>0andα1E24

Rényi’s αentropy equation is therefore a general expression for entropy and comprises of a family of entropies for different values of the parameter α. Shannon’s entropy is a special case of Rényi’s entropy in the limit as α1. The argument of the logarithm function in (Eq. (24)) is called the Information Potential. The α-Information Potential is expressed as:

VαX=k=1NpkαE25

Substituting (Eq. (25)) in (Eq. (24)), we get the following expression for Rényi’s entropy in terms of the Information Potential:

HαX=11αlog2VαXE26

The Information Potential in (Eq. (25)) can be written as the expected value of the PDF of the sample distribution raised to α1:

VαX=k=1Npkα=k=1Npkpkα1=Epkα1E27

For α=2 in (Eq. (24)), we get Rényi’s quadratic entropy, which has the useful property that it allows us to compute the entropy directly from the samples. The equations for Rényi’s Quadratic Entropy (QE) and Quadratic Information Potential (QIP) are obtained by substituting α=2 in (Eqs. (26) and (27)):

H2X=112log2V2X=log2V2XwhereV2X=Epk21=EpkE28

The QIP is therefore the expected value of the PDF of the given data samples.

4.2 Rényi’s quadratic information potential (QIP) estimator

From (Eq. (28)), it is evident that to compute the QIP we need to know the PDF of the given data samples. In practical applications an analytical expression of the PDF is rarely available. Therefore, the QIP computation involves a non-parametric estimator of the PDF directly from the samples [14]. The Parzen-Rosenblatt window estimator [15, 16] is a non-parametric way to estimate the PDF of a random variable from its sample values. This estimator places a kernel function with its center at each of the samples. The resulting output values are averaged over all the samples to estimate the PDF. The laws governing the interaction of the Information Theoretic particles is defined by the shape of the kernel. We use a Gaussian kernel, since this kernel when placed over the samples, behaves like an Information Theoretic field whose strength decays with increasing distance between the samples. Just like a charge in space creates an electric field, the samples of a probability distribution behave like Information Particles with unit charge. Information particles exert Information Forces on other particles through this Information Theoretic field.

For scalar samples x1,x2,xN, the Parzen window PDF estimator with a Gaussian kernel is expressed as:

p̂x=1Ni=1NGσxxiE29

where Gσu is the following standard univariate Gaussian kernel:

Gσu=12πσ2exp12uσ2E30

σ is the kernel bandwidth of the estimator and it must be carefully chosen to obtain an accurate and unbiased estimate of the PDF. The Parzen window estimator of a multivariate PDF for vector samples x1,x2,xN of dimension d is expressed as:

p̂x=1Ni=1NGCxxiE31

where GCu is the following standard multivariate Gaussian kernel:

GCu=12πdCexp12uTC1uE32

d is the dimension of the input vector u, C is the d×d covariance matrix and C is the determinant of the covariance matrix. For the multivariate PDF case, the kernel bandwidth C must be carefully chosen to obtain an accurate and unbiased estimate of the PDF.

Rényi’s quadratic entropy for a continuous random variable is expressed as:

H2X=logp2xdxE33

Substituting p̂x from (Eq. (29)) for px in the above equation as described in [5], we get the following equation for the QE estimator:

H2̂X=log1N2i=1Nj=1NGσ2xjxiE34

where:

Gσ2u=12πσ22exp12uσ22E35

The equation for the QE estimator shows that we can compute the QE estimate directly from the samples of a distribution without knowing its PDF, by applying the Parzen-Rosenblatt kernel on these samples. From (Eqs. (28) and (34)), the QIP estimator can be expressed as:

V̂2X=1N2i=1Nj=1NGσ2xjxiE36

4.3 Information potential energy and the information force of information theoretic particles

The total QIP energy estimate of the system is given by (Eq. (36)). The QIP energy estimate of sample xj due to the Information Potential field of a single sample xi is:

V̂2xjxi=Gσ2xjxiE37

The Quadratic Information Potential energy estimate of scalar sample xj in the Information Field created by all the samples xiR, for i=1,2,N is defined as the average of V̂2xjxi taken over all the samples xi:

V̂2xj=1Ni=1NGσ2xjxi=1N12πσ2i=1Nexp12xjxiσ22E38

If the samples are d dimensional vectors, then the Quadratic Information Potential energy estimate of vector sample xj in the Information Potential Field created by all vector samples xiRd, for i=1,2,N is defined as:

V̂2xj=1Ni=1NG2CxjxiE39

where:

G2C=12πdC2di=1Nexp12xjxiT2C1xjxiE40

From (Eqs. (39) and (40)) we can re-write the QIP energy estimate for vector samples of d dimensions as:

V̂2xj=1N12πdC2di=1Nexp12xjxiT2C1xjxiE41

To obtain the Quadratic Information Force estimate on scalar sample xj due to the Information Potential field of sample xi, we take the derivative of the Quadratic Information Potential energy estimate:

F̂2xjxi=xjV̂2xjxi=xjGσ2xjxi=xj12πσ2exp12xjxiσ22=12πσ2exp12xjxiσ22122σ2xjxjxi2=12πσ2exp12xjxiσ22122σ22xjxi
F̂2xjxi=12σ212πσ2exp12xjxiσ221xjxi=12σ212πσ2exp12xjxiσ22xixj=12σ2Gσ2xjxixixjE42

The Quadratic Information Force on scalar sample xj in the Information Potential Field created by all the samples xiR, for i=1,2,N is defined as the average of F̂2xjxi taken over all the samples xi:

F̂2xj=1N2σ2i=1NGσ2xjxixixj=12Nσ212πσ2i=1Nexp12xjxiσ22xixjE43

If the samples are d dimensional vectors, then the Quadratic Information Force on vector sample xj in the Information Potential Field created by all samples xiRd, for i=1,2,N is defined as:

F̂2xj=12dNCi=1NG2Cxjxixixj=12dNC12πdC2d×i=1Nexp12xjxiT2C1xjxixixjE44
Advertisement

5. Hamiltonian MCMC with information theoretic particles

The expression for the potential energy in the Hamiltonian function (Eq. (11)) is similar to the expression for Rényi’s quadratic entropy (Eq. (28)). This is consistent with the principles of statistical mechanics where the entropy is related to the dissipation of the potential energy of the system. Based on this intuition from statistical mechanics, we replace the PDF of the position vector q in (Eq. (11)) with the QIP energy estimator as follows:

Uqj=logPqj=logV̂2qjE45

The change in momentum of the jth Information Theoretic particle in the dynamical system is equal to the negative potential energy gradient defined in (Eq. (16)). This can be expressed in terms of the QIP energy estimator as:

dpdt=dUqjdqj=dlogPqjdqj=dlogV̂2qjdqjE46

From the above expression, we derive the expression for the Hamiltonian system’s negative potential gradient in terms of the Information Potential and the Information Force as follows:

dlogV̂2qjdqj=ddqjlog1N12πdΣ2d×i=1Nexp12qjqiT2Σ1qjqi
dlogV̂2qjdqj=ddqjlog1N12πdΣ2dddqjlogi=1Nexp12qjqiT2Σ1qjqi=1i=1Nexp12qjqiT2Σ1qjqi×i=1Nddqjexp12qjqiT2Σ1qjqi=1i=1Nexp12qjqiT2Σ1qjqi×12dΣi=1Nexp12qjqiT2Σ1qjqiqiqj=F̂2qjV̂2qjE47

This result shows that the gradient of the potential energy of the Hamiltonian system of Information Particles is just the Information Force estimate normalized by the Information Potential energy estimate. This also shows that the Information Force vector influences the trajectory of sample proposals in the HMC algorithm. This equation is one of the important contributions of our work. Our simulation of the HMC of a dynamical system of Information Theoretic particles uses this potential energy gradient equation to evolve the system over time.

5.1 Quality of the information potential energy estimator

As described in [5], the Information Potential energy estimator is a kernel estimator of the 2-norm of the underlying PDF of the Information Particles. Just like a PDF estimator, we can define metrics to describe the quality of the Information Potential energy estimator. The Mean Integrated Square Error (MISE) is an important metric used to assess the quality of an estimator. This is expressed as:

MISEV̂2qj=EV̂2qjV2qj2dq=EV̂2qjEV̂2qj2dq+EV̂2qjV2qj2dq=VarianceV̂2qjdq+Bias2V̂2qjdqE48

The bias and variance of the Information Potential estimator can be derived as follows:

Bias:

EV̂2qjV2qj=E1Ni=1NGσ2qjqiV2qj=EGσ2qjqiV2qjE49

Since the Gaussian kernel is symmetric under the expectation operation:

Gσ2qjqi=Gσ2qiqjE50

Substituting this in (Eq. (49)) and using the definition of Gσ2 from (Eq. (35)):

EV̂2qjV2qj==EGσ2qiqjV2qj=1σ2EGqiqjσ2V2qj=1σ2Gsqjσ2V2sdsV2qjE51

In the above equation sis the dummy variable of integration. Let y=sqjσ2.This implies that dy=dsσ2. Substituting this in (Eq. (51)), we get:

EV̂2qjV2qj=GyV2qj+σ2ydyV2qjE52

When σ2 is small, we can write the Taylor series expansion of V2qj+σ2y as:

V2qj+σ2y=V2qj+σ2yV2qj+122σ2y2V2qj+oσ2E53

Substituting this in (Eq. (52)), we get:

EV̂2qjV2qj=GyV2qj+σ2yV2qj+σ2y2V2qj+oσ2dyV2qj=V2qjGydy+σ2V2qjyGydy+σ2V2qjy2Gydy+oσ2V2qj=V2qj1+σ2V2qj0+σ2V2qjy2Gydy+oσ2V2qj=σ2V2qjy2Gydy+oσ2E54

This result implies that as the kernel bandwidth σ0 the bias of the Information Potential energy estimator for sample qj reduces at the rate of Oσ2. From the above equation it is also evident that the main reason for the bias is the second derivative of the true Information Potential energy (i.e., the rate of curvature of the true PDF of the samples). In other words, if the true PDF of the samples has a sharp spike, the bias of the Information Potential energy estimator will increase. The Information Potential energy estimator tends to smooth out sharp curvatures or spikes in the PDF which increases bias. The amount of smoothness is governed by the bandwidth parameter σ.

Variance:

EV̂2qj2EV̂2qj2=EV̂2qj21NV2qj+Bias2=EV̂2qj2+ON1=E1Ni=1NGσ2qjqi2+ON1=E1Ni=1NGσ2qiqj2+ON1=1NEGσ2qiqj2+ON1=12Nσ2G2sqjσ2V2sds+ON1E55

Let y=sqjσ2.This implies that dy=dsσ2. Substituting this in (Eq. (55)), we get:

EV̂2qj2EV̂2qj2=12G2yV2qj+σ2ydy+ON1E56

When σ2 is small, we can write the Taylor series expansion of V2qj+σ2y as:

V2qj+σ2y=V2qj+σ2yV2qj+oσE57

Substituting this in (Eq. (56)), we get:

EV̂2qj2EV̂2qj2=12G2yV2qj+σ2yV2qj+oσds+ON1=12V2qjG2y+o12E58

This result shows that as the number of samples N and kernel bandwidth σ, the variance of the Information Potential energy estimator for the sample qj reduces at the rate of O12. However, as σ0, the variance of the estimator increases. The result also shows that the variance of the estimator is large where the value of the Information Potential energy V2qj (i.e., true probability of the sample) is also large. This happens when there are many Information Particles closer together.

5.2 The Kernel bandwidth parameter and the information potential energy estimator bias-variance trade-off

We have shown that the Gaussian kernel bandwidth σ directly influences the bias and variance of the Information Potential energy estimator. This in turn affects the sample distribution of the PDF estimate generated by the Hamiltonian MCMC. From (Eq. (54)) it is evident that the bias of the estimator reduces when we decrease the kernel bandwidth σ. However, (Eq. (58)) clearly shows that the decreasing σ increases the variance of the estimator. Therefore, we must choose an optimum bandwidth which minimizes both the systematic error (bias) and the random error (variance) of the Information Potential energy estimator. An iterative algorithm to converge to the optimum kernel bandwidth is described in the following section.

5.3 Computational complexity of the information potential energy estimator

From (Eqs. (38) and (41)) it may appear that the complexity of computing the Information Potential is ON2. However, as described in [5], the Information Potential can be written as a symmetric positive Gramm Matrix which can be approximated using the incomplete Cholesky decomposition (ICD) as an N×D matrix where DN. Using this technique, the time complexity for computing the Information Potential reduces to OND2 and the space complexity reduces to OND.

Advertisement

6. Maximum-likelihood iterative algorithm to adapt the kernel bandwidth of the information potential energy estimator

There are many iterative kernel bandwidth adaptation techniques available in the literature. We present a simple iterative technique to illustrate how MCMC with Hamiltonian of Information Theoretic Particles can be used to adjust the bandwidth parameter of the iterative PDF estimator. Here, we have chosen to minimize the Kullback–Leibler (K-L) divergence between the samples of the estimated PDF and the target sample distribution as the criteria for adapting the kernel bandwidth of the Information Potential energy and Information Force estimator. As described in [17], this is equivalent to maximizing the likelihood that the estimated PDF samples output by the MCMC, has the same distribution as the target samples.

The ML estimate of the optimum kernel bandwidth CML for vector Information Particle samples qj is the solution to the following log-likelihood maximization problem:

CML=argmaxCj=1NlogV̂qjCE59

Using (Eq. (41)) in the summation of the above equation, we get:

j=1NlogVqjC=j=1Nlog1N1i=1ijNG2Cqjqi=j=1Nlog1N112πd2dC×i=1ijNexp12qjqiT2C1qjqiE60

To maximize the above equation, we take the derivative and equate it to 0. This gives us the following update equation for scalar Information Theoretic particles:

σt+12=12NN1j=1N1V̂qji=1ijNGσ2qjqiqjqi2tE61

In the above equation, σt+1 is the kernel bandwidth at iteration t+1. It is updated with the result of the right-hand side of the equation obtained at time t. This kernel bandwidth update equation (Eq. (61)) is in the form of a fixed-point (or invariant point) equation. This equation is like the equation in [18] except for the factor of 1/2. For vector Information Theoretic particles, the kernel bandwidth update equation is:

Ct+1=12NN1j=1N1V̂qji=1ijNG2CqjqiqjqiqjqiTtE62

In this equation, C is the kernel bandwidth matrix and can have unequal elements along its diagonal or non-zero off-diagonal elements. If the kernel bandwidth matrix is constrained to an identity matrix multiplied by a scaling factor, the kernel bandwidth matrix update equation can be expressed as:

Ct+1=12NN1j=1N1Vqji=1ijNG2Cqjqiqjqi2tE63

From the fixed- or invariant-point theorem, the range over which the fixed-point bandwidth update equations will converge to a unique solution is:

minqjqi2¯2TraceEqqTE64

In the above equation, qi,qj are information particles from the target sample distribution and q is the column vector of all the target information particles. From the fixed-point theorem, this fixed-point equation will converge to a unique solution if fσ2<1.

Advertisement

7. Simulation results

The potential energy surface, which is the plot of (Eq. (41)), of a Hamiltonian system of Information Theoretic particles for a bivariate Gaussian distribution is shown in Figure 1. From this figure it is evident that the potential energy surface of the Hamiltonian system has larger values when the Information Theoretic particles are sparse and is lowest at the bottom of the bowl-shaped surface where the particles have the highest density.

Figure 1.

Potential energy surface of the Hamiltonian system of a bivariate Gaussian μ=56Σ=3004 distribution of Information Theoretic particles.

The momentum variable of the HMC algorithm occasionally moves the “probe” particle to a higher energy level but the Hamiltonian system has the tendency to fall back to its lowest energy level along the bowl-shaped surface. As a result, the HMC tends to sample the given target distribution more often where the density of the Information Theoretic particles is the largest.

Figure 2 shows the potential energy gradient of the same bivariate Gaussian distribution. This is the plot of (Eq. (47)) for this distribution. Each surface in this figure is one component of the potential energy gradient. Each surface tilts towards the corresponding mean value μ=56 of the bivariate Gaussian distribution. The figure shows that the potential energy gradient of the Hamiltonian system is lowest near the mean of the distribution and is highest further away from the mean. The time evolution trajectory of the Hamiltonian system lies on this surface.

Figure 2.

Vector components of the potential energy gradient of the Hamiltonian system of a bivariate Gaussian μ=56Σ=3004 distribution of Information Theoretic particles.

The iterative PDF estimate of a bivariate Gaussian distribution with μ=56,Σ=3004 using MCMC with 3 different kernel bandwidths is shown in Figure 3.

Figure 3.

The left-hand side figure shows the iterative PDF estimate of a bivariate Gaussian distribution μ=56Σ=3004 with the MCMC method using the Hamiltonian of Information Theoretic particles. The right-hand side figure shows that the samples generated by the HMC method mostly overlaps the samples of the target distribution.

From Figure 3, it is evident that the MCMC algorithm based on the Hamiltonian of Information Theoretic particles accurately estimates the PDF of the target distribution. The sample points generated by the HMC algorithm covers most of the target samples in this figure. This figure shows that our intuition of comparing the Entropy to the system’s potential and also using the Information Potential in the derivation of the potential gradient (Eq. (47)) of the Hamiltonian system of Information Theoretic particles, was correct.

Our HMC algorithm using Information Theoretic particles also works well for Gaussian mixture distributions. Figure 4 shows that our MCMC algorithm using the Hamiltonian of Information Theoretic particles can be used to iteratively estimate the PDF of different multivariate distributions.

Figure 4.

Iterative estimation of the PDF of a bivariate Gaussian mixture distribution with the MCMC method using the Hamiltonian of Information Theoretic particles.

Figure 5 shows that the contour plot of the estimated PDF matches closely to the target PDF. The corresponding samples generated by the HMC algorithm traverses the two clusters of the bivariate Gaussian mixture distribution and covers most of the samples of the target distribution.

Figure 5.

Contour plots of the target PDF and the estimated PDF of the bivariate Gaussian mixture distribution. Samples generated by the MCMC algorithm using the Hamiltonian of Information Theoretic particles.

Advertisement

8. Conclusion

We have proposed a novel perspective on the MCMC method where we used it to iteratively estimate the PDF of a given target sample distribution. We have shown that the samples of a probability distribution can be viewed as Information Particles in an Information Field. These particles have Information Potential energy and are subject to Information Forces by virtue of their position in the field. The concept of Information Potential energy fits perfectly within the framework of the Hamiltonian of a dynamical system. We have derived an important result that the gradient of the potential energy of the Hamiltonian system of Information Particles is just the Information Force estimate normalized by the Information Potential energy estimate.

Our simulation results show that our intuition of comparing Rényi’s Quadratic Entropy equation with the Hamiltonian potential energy equation to derive the equation for the potential gradient of a dynamical system of Information Theoretic particles was correct. Using this equation, we were able to accurately estimate univariate and multivariate PDFs. Based on the fixed- or invariant-point theorem, we also derived an equation to iteratively update the bandwidth parameter of the Information Potential and Information Force estimators.

In machine learning applications the dataset is sometimes resampled to the appropriate size before starting the learning operation. Our algorithm can be used to view the data samples as Information Theoretic particles and resample it using the HMC described in this chapter.

References

  1. 1. R. Neal, “Probabilistic Inference using Markov Chain Monte Carlo methods, Technical Report CRG-TR-93-1,” Department of Computer Science, University of Toronto, Toronto, 1993
  2. 2. R. Neal, “An improved acceptance procedure for the hybrid Monte Carlo algorithm, “Journal of Computational Physics, vol. 111, no. 1, pp. 194–203, 1994
  3. 3. M. Betancourt, “A Conceptual Introduction to Hamiltonian Monte Carlo, “arXiv: 1701.02434 [stat.ME], 2018
  4. 4. R. Neal, “MCMC using Hamiltonian Dynamics, “in Handbook of Markov Chain Monte Carlo, CRC Press, 2011, pp. 113–162
  5. 5. J. Principe, Information Theoretic Learning, New York: Springer, 2010
  6. 6. W. K. Hastings, “Monte Carlo Sampling Methods Using Markov Chains and Their Applications, “Biometrika, vol. 57, no. 1, pp. 97–109, 1970
  7. 7. A. Rényi, “On measures of entropy and information, “Proceedings of the 4th Berkeley symposium on math, statistics and probability, vol. 1, pp. 547–561, 1961
  8. 8. Z. Makó and Z. P. M. D. 7. 4. & Páles, “On the equality of generalized quasiarithmetic means.,” Publicationes Mathematicae, Debrecen, vol. 72, pp. 407–440, 2008
  9. 9. A. N. Kolmogorov, “Sur la notion de la moyenne,” Atti Accad. Naz. Lincei. Rend. 12:9, pp. 388–391, 1930
  10. 10. M. Nagumo, “Über eine klasse von mittelwerte,” Japanese Journal of Mathematics, vol. 7, pp. 71–79, 1930
  11. 11. G. H. Hardy, L. J. E. and P. G., Inequalities, Cambridge, 1934
  12. 12. R. V. L. Hartley, “Transmission of Information,” Bell System Technical Journal, vol. 7, p. 535, 1928
  13. 13. C. E. Shannon, “A mathematical theory of communication,” Bell System Technical Journal, p. 535, 1928
  14. 14. M. Deb and T. Ogunfunmi, “Using Information Theoretic Learning techniques to train neural networks,” in 51st Asilomar conference on signals, systems and computers, 2017
  15. 15. E. Parzen, “On Estimation of a Probability Density Function and Mode,” Annals of Mathematical Statistics, vol. 33, no. 3, pp. 1065–1076, 1962
  16. 16. M. Rosenblatt, “Remarks on some Nonparametric Estimates of a Density Function,” Annals of Mathematical Statistics, vol. 27, no. 3, pp. 832–837, 1956
  17. 17. T. Cover and J. Thomas, Elements of Information Theory, Wiley & Sons, 2012
  18. 18. J. M. Leiva-Murillo and A. Artés-Rodriguez, “Fixed point algorithm for finding the optimal covariance matrix in kernel density modeling,” IEEE International Conference on Acoustics, Speech and Signal Procesing, vol. 5, 2006

Written By

Tokunbo Ogunfunmi and Manas Deb

Submitted: 09 September 2021 Reviewed: 13 September 2021 Published: 04 November 2021