Posterior distribution for the number of time correlation decay channels for a polymer solution of polyethylene glycol with a molecular weight of as measured in a NSE experiment and collected at a momentum transfer .
The rapidly improving performance of inelastic scattering instruments has prompted tremendous advances in our knowledge of the high-frequency dynamics of disordered systems, yet also imposing new demands to the data analysis and interpretation. This ongoing effort is likely to reach soon an impasse, unless new protocols are developed in the data modeling. This need stems from the increasingly detailed information sought for in typical line shape measurements, which often touches or crosses the boundaries imposed by the limited experimental accuracy. Given this scenario, the risk of a bias and an over-parametrized data modeling represents a concrete threat for further advances in the field. Being aware of the severity of the problem, we illustrate here the new hopes brought in this area by Bayesian inference methods. Making reference to recent literature results, we demonstrate the superior ability of these methods in providing a probabilistic and evidence-based modeling of experimental data. Most importantly, this approach can enable hypothesis test involving competitive line shape models and is intrinsically equipped with natural antidotes against the risk of over-parametrization as it naturally enforces the Occam maximum parsimony principle, which favors intrinsically simple models over overly complex ones.
- inelastic X-ray scattering
- inelastic neutron scattering
- Bayes analysis
- MCMC methods
- model choice
In the last decade, a large amount of inelastic neutron and X-ray scattering measurements focused on the study of the collective atomic dynamics of disordered system [1, 2, 3, 4, 5]. Although, across the years, the analysis of the line shape reported in these measurements seldom benefited from the support of a Bayesian inference analysis, the need of this statistical tool is becoming increasingly urgent. As a general premise, it is worth stressing that a scattering measurement somehow resembles a microscope pointed on the dynamics, whose “focus” can be adjusted by suitable choice of the momentum and the energy exchanged between the particle beam and the target sample in the scattering event, where is the reduced Planck constant, is the wave vector transfer, and is the angular frequency. Specifically, upon increasing the probe “perceives” the response of the system as an average over smaller and smaller distances and over times including a decreasing number of elementary microscopic events, e.g., mutual interatomic collisions. The observable accessed by these spectroscopic methods is the spectrum associated to density fluctuations, either spontaneous or scattering induced. When quasi-macroscopic distances are probed, i.e., in the limit, the detail of atomic structure is lost, and the target sample is perceived as a continuum medium, whose dynamic behavior is recorded as an average over many elementary events . Being the mass conserved over macroscopic scales, at these distances the liquid density tends to become a constant of motion, i.e., a time-invariant. For this reason, quasi-macroscopic density fluctuations relax very slowly to equilibrium, and collective density oscillations are correspondingly very long-living. The typical spectral signature of this so-called hydrodynamic behavior is a very sharp triplet reflecting the quasi-conserved nature of hydrodynamic density fluctuations. A striking example of such sharp triplet shape is provided in panel A of Figure 1 , where the spectrum on liquid argon at the triple point is reported as measured by Brillouin visible light scattering (BVLS) .
One could guess that such a sharp spectral shape does not leave any room for interpretative doubts, also considering that the limiting hydrodynamic spectral profile is exactly known as analytically treatable starting from the application of mass, momentum, and energy conservation laws. Although these statements appear partly true, the very concept of “interpretative doubt” sounds grossly ill-defined before spelling out explicitly the accuracy required to the interpretation one alludes to. Despite its pioneering nature, the quality of the measurements in panel A seems certainly adequate for a precise determination of the side-peak position, probably not much so for a detailed analysis of the spectral tails, which are dominated by the slowly decaying resolution wings. Nonetheless such a shape might still appear a more encouraging candidate for a line shape analysis than its counterpart reported in panel B of Figure 1 which is featured by broad and loosely resolved spectral features, besides a definitely poorer count statistics. Given that the latter result is fairly prototypical of terahertz spectroscopic measurements on simple disordered systems, one might wonder why, thus far, the analysis of these measurements failed to benefit from Bayesian inference methods as routine line shape analysis tools. Aside of hardly justifiable initial mistrusts, a likely explanation is that only recently these spectroscopic techniques transitioned to a mature age in which the very detection of collective modes in amorphous systems can no longer be considered a discovery in itself, and detailed analyses of the spectral shape are more and more common and required. Again, the take-on message of this course of events is that the pivotal issue is the adequacy of a given measurement to provide the sought for information, rather than the quality of the measurement in itself. The unbalance between an unavoidably limited experimental performance and the rapidly increasing interpretative needs dramatically enhances the risk of “good faith overinterpretations” representing a lethal threat for the progress of knowledge.
Listen, the data talk! Every time we need to proceed with a data analysis, we could be induced or even tempted, on the basis of our prior knowledge or intuition, to somehow suggest the data what they should tell us about the properties of the system we are investigating. Being driven by acquired knowledge is not necessarily a wrong attitude, it is actually a natural demeanor which effectively drives the cognitive process and the progress of knowledge. However it could become deceiving if we do not have well-consolidated insight about the system under investigation and the observed data are not accurate enough or barely informative. In such cases, in fact, it is highly probable that we just adapt a model to the data, which fits them as well as many other possible models, with the only advantage to deliver results and solutions we feel more at ease with, as they confirm our prior beliefs. This model, of course, can be really plausible and reasonably pondered, and the solution adopted can accidentally be the right one; however, it would be desirable a robust method to quantify how much we can trust such a solution, either in itself or in comparison with alternative ones. We surely want to avoid an aprioristic reliance in a model, which might coerce data to confirm certain results preventing them from providing new insights on the investigated system.
When dealing with neutron or X-ray scattering, the statistical accuracy of spectral acquisition is the primary concern. For the most varied reasons, e.g., relating to the scattering properties of the sample, the integration time, or the count rate of the measurement, the achieved count statistics may either be adequate for a rigorous data analysis or, as often happens, not as good as we would like it to be. In the latter case, the experimental data might not be accurate enough to tell us everything about the physical problem under scrutiny. They could tell us something, but not everything! This is why we need a solid inferential method capable of extracting the maximum amount of information from the data acquired and possibly providing us with a quantitative probabilistic evaluation of the different models that are compatible with the data at hand. Especially when nothing or very little is known about a specific sample or system, the point is, given the observed data, how plausible is a specific model? What is the precision of the conclusions drawn from this model? Are there other possible interpretations of the data at hand? To what extent are different models and interpretations supported by the observed data?
A Bayesian inferential approach provides answers to all these questions on a probabilistic basis, along with a sound criterion to integrate any prior knowledge in the process of data analysis. Bayesian inference, in fact, recognizes the importance of including prior knowledge in the analysis. When we do have well-established prior knowledge about a sample property or a general law a physical phenomenon must comply with, it would be insane and pointless not to use this information. Such a prior knowledge, in fact, can protect us from the risk of making mistakes in the description of experimental data, hence in their interpretation. In the Bayesian framework, prior knowledge takes the form of probability statements so that different probabilities, ranging from zero to one, can be attributed to competitive explanations of the data. In this way, less probable explanations are not excluded a priori but simply given a smaller prior probability. The a priori probability of different explanations is then updated, through the Bayes theorem, based on the new information provided by the data. The results of this analysis, thus, assume the form of posterior probabilities. On this basis, one can easily establish which model is most supported by both data and prior knowledge, what are the posterior probabilities of alternative models and those of their parameters, and which provides a ground to appreciate the precision of their estimates. In addition, Bayesian methods naturally embody the Occam’s razor principle, thus favoring simpler models over unnecessarily complex ones. Last but not least, Bayesian estimation algorithms are generally less affected by the presence of local optima in the parameter space and are not sensitive to the starting values used to initialize the estimation process.
The aim of this chapter is to illustrate how Bayesian inference can be used in X-ray and neutron scattering applications. The Bayesian approach proposed here is implemented through an estimation algorithm, which makes use of Markov chains Monte Carlo (MCMC) methods [9, 10] integrated, where appropriate, with a reversible jump () extension . This Bayesian method has been already successfully applied in a series of Brillouin inelastic neutron scattering works , as well as inelastic X-ray scattering ones [13, 14] and, very recently, in the description of the time correlation function decay in the time domain as measured by spin echo neutron scattering [15, 16]. The rest of the work is organized as follows: Section 2 provides a motivating example; Section 3 revises the Bayes theorem and discusses its different components, as well as some advantages inherent in the Bayesian method; Section 4 applies the Bayesian inference to X-ray and neutron scattering spectroscopy with special emphasis on model choice, parameter estimation, and results interpretation.
2. An example: searching for differences
Depending on the problem at hand, our approach to data analysis can be very different. Imagine that we want, as a toy or teaching example, to measure either the neutron or the X-ray scattering spectrum from a system whose spectrum is well-known and its interpretation unanimously agreed. For instance, we aim at extracting the phonon dispersion curve from the thoroughly measured spectral density of a given sample. In our replica of past measurements, it is possible that the proper discernment of the excitation lines is hampered by both the course instrumental resolution and the limited statistical accuracy. The poor quality of data could prevent us from easily identifying the spectral features (peaks, bumps, shoulders), already measured and characterized by others. For instance, it could be overly difficult to establish how many excitations are present in the spectra. Unless we deliberately refute the conclusions previously reached by other scientists, it is natural to enforce a line shape modeling well-established in the kinematic range spanned and to verify ex post if the resulting spectral features are consistent with those known from literature.
More often, we face a different problem, as we want to measure for the first time a certain system on which we might not have previous knowledge. Alternatively, we could have prior knowledge about that same system, yet in different thermodynamic or environmental conditions— for instance, a liquid either in bulk or confinement geometries—and possible effects of these peculiar conditions are under scrutiny. Changes could also be very small, and, since detecting them is the focus of our research, it is essential to take the most impartial and noninvasive approach. In this second situation, it would be desirable not to rely too heavily on previous results when choosing the model and to allow the measurement to reveal possible new features.
The two situations mentioned above notably differ in the amount of information available on the system before analyzing the data. In the first case, we have a complete knowledge of the system, while, in the second case, this knowledge is partial or even lacking at all. In this second situation, a traditional approach would consist in either best fitting a model we deem adequate for the data, e.g., well-assessed for the considered sample, albeit only in different thermodynamic or environmental conditions, or fitting competing models to establish the one best performing based on criteria agreed upon, e.g., the chi-square value. Following the first path, we hinge too much on a specific model and on previous partial knowledge, thus jeopardizing the chance of new findings. On the other hand, the second path would be less coercive at the cost of completely ignoring previous partial knowledge. In addition, the model chosen would be simply the one providing the best fit, but no assessment can be made on the plausibility of this or any other fitting model, based on the data measured. Conversely, a Bayesian approach to data analysis would, instead, allow to assign a different prior probability to the different models (accounting for the uncertainty of available information on the system) and, then, revise these probabilities in the light of the data to deliver the posterior probability of each model, conditional on the data at hand.
3. Bayesian inference
3.1 The Bayes theorem
The Bayes theorem stems from the theorem of compound probability and from the definition of conditional probability. If we consider two events and , the compound probability theorem states that the probability of the two events occurring simultaneously is given by:
where is the probability of observing , once has been observed. Obviously, if and are independent, so that the occurrence of one of them does not affect the probability of occurrence of the other one, the compound probability theorem reduces to:
From Eq. (1), we immediately get:
which is nothing else than the Bayes theorem.
Let us now consider as the ensemble of the parameters of a certain model (or class of models) we choose to describe experimental data. In a slightly different notation, let this ensemble be denoted, from now on, as the vector , where each vector component is a parameter. Notice that a component of might also be associated to a parameter that designates a particular model among several proposed. Also, consider as the entire set of experimental data. Let us indicate this dataset with the vector , with being the sample size. In this new notation, the Bayes theorem reads obviously as:
where is the posterior distribution of the parameters given the observed data; is the prior distribution of the parameters before observing the data; is the likelihood of the data, i.e., the probability of observing the data conditional on a certain parameter vector; and is the marginal probability of the data, which plays the role of normalizing constant so that Eq. (4) has a unit integral over the variable . The different elements of Eq. (4) are thoroughly discussed in the following sections.
3.2 The prior distribution
Let us consider the different elements of Eq. (4), starting with the prior distribution (or simply prior) . This is the distribution function elicited for the parameters, given the information at our disposal before data collection. Using a slightly more redundant notation, the prior can be explicitly denoted as , where represents the a priori information. This prior probability includes all prior knowledge (or lack of it) we might have, and it can be more or less informative depending on the amount of information on the problem under investigation. Using the same explicit notation, the Bayes formula in Eq. (4) can be rewritten as:
Just to make a few examples, it might be possible that a certain parameter included in the model is known, or either already measured or somehow evaluated independently, and its value is . In this case, we can assume that the parameter takes the specific value with probability equal to one. Otherwise, if we want to be less coercive, we can adopt for the parameter a Gaussian prior centered on and with a variance opportunely chosen to limit the parameter variability to a certain interval around . In this way, values closer to will be given a higher a priori probability.
In other situations, the information available on the parameters might be more vague. For example, we might simply know that a certain parameter must be nonnegative or that it must range in a limited interval, as often the case of neutron scattering hampered by severe kinematic constraints. Nonnegative parameters can be a priori assumed to follow, for example, a truncated Gaussian or a gamma distribution, and, if no other information is available, the prior distribution will be adjusted to make allowance for a large parameter variability, reflecting the noninformative initial guess. Parameters having random or hardly quantifiable variations within limited windows can be assumed to approximately follow a uniform distribution over such a window. Also, whenever feasible, any mutual entanglement between parameters, as well as any selection, conservation, or sum rule, should be embodied in a usable distribution function complementing our prior knowledge in the cognitive process.
Notice that, even if it is common practice to assume that the parameters are a priori independently distributed, correlation between them can be naturally induced by the data, through the combination of the likelihood and the prior. Parameters can be a posteriori correlated, even if they are a priori independent.
3.3 The likelihood function
The likelihood function is the joint probability of the observed data, conditional on the model adopted and its parameter values. Notice that for continuous data, the likelihood becomes a density of probability. Let be a sample of data. Each datum can be portrayed as a particular realization of a random variable distributed as . In fact, if we had to collect data again, even under the same experimental conditions, we would obtain a different sample of data. This means that, before collecting data, the i-th result is to be considered a random variable . Once the data have been collected, is the particular realization observed of the random variable , and the sample is the particular realization observed of the multiple random variable , whose components are independent and identically distributed as . Then, the joint (density of) probability of the observed data is the probability that, simultaneously, each variable takes the value (or takes a value in the interval , for , i.e., . Given the independence of the variables , for , and using the compound probability theorem, we obtain:
The left side of Eq. (6) is the likelihood function for the observed sample , which depends on the unknown parameter vector . If we condition on a particular value of , we can compute the probability (or density) of the observed sample, conditional on , i.e., in Eq. (3).
To be more specific, we can consider spectroscopic data. The observable directly accessed by a spectroscopic measurement is the spectrum of the correlation function of density fluctuation, or dynamic structure factor , which, in a scattering experiment, is a unique function of the energy, , and the momentum, , exchanged between the probe particles and the target sample in the scattering process. One has:
where is the model used for the dynamic structure factor, depending on a vector of unknown parameters , and is a vector of random errors, here assumed to be independently and normally distributed, i.e., , for . Notice that assuming heteroscedastic errors, we are not imposing any restriction other than normality on the error term. The heteroscedastic model embeds the homoscedastic one, and since the parameters are estimated from the data, it might reduce to it if the data were compatible with the homoscedasticity constraint.
Under the assumption above, the likelihood function is:
Conditional on a certain value of the parameter vector (which might also include the variance of the error term), we can compute and, thus, .
3.4 The posterior distribution and its normalizing constant
The term on the left-hand side of Eq. (3) is the joint posterior distribution of the model parameters, given prior knowledge and measured data, i.e., after data collection. It incorporates both prior knowledge and the information conveyed by the data, and Bayesian inference completely relies on it. In practice, prior knowledge about the investigated problem is modified by the data evidence (through the likelihood function) to provide the final posterior distribution ( Figure 2 ). Estimates for a single parameter can be obtained by marginalizing, i.e., by integrating (summing) the posterior over all the other parameters to get , where is the whole parameter vector except . Then, the mean of is taken as a point estimate of , while the square root of its variance provides a measure of the estimation error. Also, the probability that the parameter belongs to a certain interval can be inferred from its marginal posterior.
The term in the denominator of Eq. (3):
is generally called the marginal likelihood and represents the probability of observing the measured data , averaged over all possible values of the model parameters. It represents the normalization constant for the posterior distribution, and it is required in the evaluation of . However, in most cases, does not have a closed analytical expression, as its determination would require the computation of high-dimensional integrals. Hence, the posterior distribution can only be obtained up to a normalizing constant, namely:
For this reason, Bayesian inference usually needs to resort to MCMC methods to simulate the joint posterior distribution. MCMC algorithms, in fact, allow to draw values from distributions known up to a normalizing constant, as is often the case for . Inference is then carried out on the basis of the simulated, rather than analytical, joint posterior distribution. More details on these methods will be given in Section 3.6 (see also Refs. [9, 10]).
To illustrate an interesting point, let us go back to the example considered before, in which we want to analyze spectroscopic data that can be modeled as in Eq. (7) and for which the likelihood is given in Eq. (8). Imagine to have no prior information at all on the parameters of the model so that the only sensible choice for the prior is a uniform distribution on the parameter space. Then, from Eqs. (8) and (10), it follows that:
which implies that the posterior distribution is a multivariate Gaussian. As already mentioned, parameters can be estimated taking the mean of the posterior distribution, which, for a Gaussian distribution, corresponds to the median, mode, and maximum of the distribution. Therefore Bayesian parameter estimates are obtained as those values of that maximize . This maximization is equivalent to the minimization of the function and thus provides the same estimates we would obtain through standard fitting procedures . Therefore, whenever no prior information is available, which translates into a uniform prior, and a normal error distribution is assumed, the posterior distribution coincides up to a constant to the classical likelihood function, and Bayesian and classical estimates are equivalent. This result can be extended to the case of an informative prior, for which, again, Bayesian and traditional approaches provide asymptotically the same results. In particular, as sample size increases, the posterior distribution of the parameter vector approaches a multivariate normal distribution, which is independent of the prior distribution. These posterior asymptotic results  formalize the notion that the importance of the prior diminishes as increases. Only when is small, the prior choice is an important part of the specification of the model. In such situations it is essential that the prior truly reflects existing and well-documented information on the parameters so that its use can significantly improve the precision of the estimates.
Despite the asymptotic equivalence, sometimes parameters are much easier estimated in a Bayesian rather than in a frequentist perspective. Frequentist estimation, in fact, is generally based on least squares or maximum likelihood methods, and this might be a problem in the presence of local optima. If, for example, the starting values of the parameters, needed to initialize the optimization algorithm, are close to a local optimum, the algorithm might be trapped in this suboptimal solution. As a consequence, different starting values might determine different solutions and, thus, parameter estimates. The Bayesian estimate of a parameter, as stated before, is instead obtained as the mean of its posterior distribution, marginalized with respect to all other parameters. This estimation procedure does not involve any minimization or maximization, and, thus, the fitting algorithm does not risk to get trapped in local optima, and the results are independent from starting values used in the MCMC algorithm used to simulate the posterior distribution (see Section 3.6). It might happen, obviously, that the posterior of one or more parameters is bimodal or multimodal. The presence of different parameter regions with high posterior density might suggest that the data show some evidence in favor of a more complex model but not enough for this model to have the highest posterior probability. In this case, it is not reasonable to use the mean as a point estimate for the parameters, since it might fall in a low posterior density region, and the mode of the posterior distribution can be used in its place. In such situations of posterior multimodality, it is evident how the whole posterior distribution conveys a much richer information than the simple parameter estimate.
3.5 The Occam’s razor principle
Even if Bayesian and classical analysis asymptotically give the same results, Bayesian results always have a probabilistic interpretation, and this is particularly relevant when we need to compare different models and determine, for instance, the number of spectral excitations (in the frequency domain) or the number of relaxations (in the time domain). In addition, the Bayesian method represents a natural implementation of the Occam’s razor [18, 19, 20]: this principle is intrinsic to Bayesian inference and is a simple consequence of the adoption of the Bayes theorem. In model choice problems, in fact, the posterior probabilities of the different models naturally penalize complex solutions with respect to simple ones, thus conforming to the parsimony principle.
To see this, consider Eq. (4), and imagine that the parameter vector also includes a model indicator parameter , for . To make this more explicit, we can rewrite Eq. (4) as . Then, Bayesian model choice simply consists in choosing the model with the highest posterior probability . Thus, if the same a priori probability is attributed to the models, i.e., , the posterior probability is simply proportional to the marginal likelihood:
Now, consider for simplicity just two possible models, the first one, denoted as , more complex and characterized by a larger number of parameters and the second one, denoted as , simpler and characterized by a smaller number of parameters. Clearly, the more complex model is able to generate a much wider range of possible datasets (i.e., for which the model would provide a reasonable fit) than the smaller model. Therefore, the marginal likelihood is more dispersed than (cf. Figure 28.3 of Ref. ). This implies that dataset in accordance with both and have , while those in accordance with just the more complex model have (with . If the two models are a priori given the same probability, for datasets in accordance with both models, the inequality holds for the posterior probabilities, determining the choice of the simplest model to represent the data.
3.6 Bayesian computation of model parameters
As already stated, Bayesian inference completely relies on the joint posterior distribution . However, for a complex model, it is often impossible to compute this posterior distribution analytically, and the latter is only known up to a normalizing constant. The MCMC methods allow to draw values from distributions known up to a normalizing constant and, thus, to obtain the simulated joint posterior distribution. In practice, MCMC methods consist in constructing an ergodic Markov chain ( Figure 3 ) with states , and stationary distribution corresponding to the joint posterior distribution. is the number of states, i.e., the number of updating of the parameter values, and is generally called the number of sweeps of the MCMC algorithm. At each sweep of the algorithm, a new draw of from its posterior is obtained updating all the parameters in turn, drawing each of them from its posterior distribution, conditional on the value of all the other parameters. If this posterior conditional distribution is known, the parameter is updated using a Gibbs sampling step, which simply draws the new value of the parameter from this known distribution. Otherwise, if this posterior conditional distribution is known only up to a normalizing constant, the parameter needs to be updated through a Metropolis-Hasting move . This move is built as follows. Suppose that, at a given sweep of the algorithm, the current value of a certain parameter is . A new candidate value can be drawn from an opportunely chosen proposal distribution , which generally depends on the current value . The new value is then accepted with a probability equal to , where is given by:
where is the whole parameter vector with the parameter replaced by the new value is the likelihood, and finally is the prior on that parameter. In other words, is nothing else but the ratio between the joint posterior distribution calculated with the updated parameter values and the posterior distribution calculated with the current ones, multiplied by the ratio between the proposals, . The higher the posterior ratio, the larger and hence the probability to move to the new parameter value. In practice, to decide whether or not a candidate value is accepted, a random number is drawn from a uniform distribution defined between 0 and 1 and compared with the calculated value for . If the random number is less than , the parameter is updated to the new value; otherwise the new value is rejected. The way the acceptation rule in Eq. (13) is built ensures that the resulting Markov chain has the joint posterior distribution as stationary distribution.
Concerning the proposal distribution, this should be chosen as a distribution from which it is easy to sample. It could be, for instance, a normal distribution centered on the current value of the parameter and with a certain variance which can be adjusted and used as a tuning parameter. This locution alludes to the circumstance that adjustments of this parameter can literally tune the step of the parameter updates. For a normal proposal distribution, a large variance allows the new value to substantially change from the current value. However, if we already are in a high posterior distribution region for the parameter, values far from the current one will fall in low-density regions and are accepted with a very low probability. As a consequence, the algorithm will remain stuck on the same value of the parameter for a long time, causing an inefficient exploration of the parameter space. On the contrary, a small variance will constrain to be close to . In this case, the new value has a high probability of being accepted, but the algorithm would move slowly and take a long time to reach convergence to the stationary distribution. The tuning parameters can be appropriately chosen so that the algorithm explores the parameter space efficiently. A rule of thumb states that this happens when the acceptance ratio for each parameter is about 30% .
When the parameter vector also includes a model indicator parameter, a further move needs to be considered to update this parameter and to allow the algorithm to explore different models. This move is a reversible jump  step, which is specifically designed to allow the Markov chain to move between states having different dimensions (since the dimension of the parameter space varies accordingly to the model considered).
As a final remark, consider that when the MCMC algorithm reaches convergence, after a so called “burn-in” period, the draws not only effectively represent samples from the joint posterior distribution but are also theoretically independent from the starting values of each parameter. Few examples about this point are shown in Table 1 of Ref. . Notice, however, that the time required to reach convergence might vary a lot depending on the data and the prior. For example, peaked unimodal posterior distributions (i.e., highly informative data) generally speed up convergence, as well as the availability of an important prior information, which reduces the size of the effectively accessible parameter space. On the contrary, the presence of many high posterior density regions can hinder and slow down convergence.
4. The Bayesian approach in neutron and X-ray scattering spectroscopy
4.1 Neutron and X-ray Brillouin scattering
One of the models commonly used to analyze either neutron or X-ray scattering data is the so-called damped harmonic oscillator (DHO) profile, which we report here below:
where is the Dirac delta function describing the elastic response of the system modulated by an intensity factor , is the Bose population factor expressing the detailed balance condition, and the term in curly brackets is the sum of a Lorentzian central contribution, characterized by the parameters and , and the contribution of pairs of peaks, the DHO doublets, symmetrically shifted from the elastic position. The generic j-th DHO is characterized by its undamped oscillation frequency , damping , and intensity factor . The Lorentzian contribution, not necessarily present, accounts for the quasielastic response of the system. We have intentionally expressed the inelastic contribution as an indefinite sum of terms, as the scattering signal from amorphous systems is often poorly structured and the number of inelastic modes contributing to it is often hard to guess. A series of concomitant factors, such as the instrument energy resolution, the limited statistical accuracy, and the intrinsically weak scattering signal, can make the line shape modeling not straightforward. In a Bayesian perspective, the number of inelastic features can be treated as a parameter to be estimated along with the other model parameters.
To fit the experimental data, the model in Eq. (14) needs to be convoluted with the instrument resolution, and it can conceivably sit on top of an unknown linear background. Overall, the final model used to approximate the measured line shape is given by:
where represents the convolution operator. For neutron scattering, the instrument resolution function has often a Gaussian shape; thus the final model reads as:
For IXS, Eqs. (14–16) are still formally valid although the instrument resolution function has usually a slightly more complex shape which appears in the convolution of Eq. (15) either as approximated by an analytical model or measured from the signal of an almost elastic scatterer; obviously, in the latter case, the convolution is computed numerically. The final model is further corrupted by an additive Gaussian noise, having a variance that, for instance, can be taken proportional to each data point. Thus, the experimental data points are given by:
The whole parameter vector for the model in Eq. (17) is , with , , and , so that the dimension of the parameter vector depends on the number of inelastic modes, . In a Bayesian perspective, suitable priors need to be chosen for each component of . For example, can be safely assumed as uniformly distributed between 1 and a certain value opportunely fixed so that all models are a priori given the same probability. All parameters only attaining nonnegative values such as can, instead, be assumed distributed according to a Gamma distribution or a Gaussian distribution truncated in zero. Finally, and are assumed to follow a normal distribution, centered in zero and with a large variance, to keep the priors scarcely informative.
Bayesian inference is, then, based on the joint posterior of the whole parameter vector . However, as mentioned, given the complexity of the model in Eq. (17), the normalizing constant in Eq. (9) cannot be analytically evaluated, and it is necessary to resort to MCMC methods to obtain a simulated joint posterior. Since the parameter space dimension depends on the number of inelastic modes, , a step needs to be added to allow the exploration of a parameter space of variable dimension. The updating of the parameter can be implemented according to different types of moves, which, for instance, can enable either the creation (the birth) of a new component in Eq. (14) or the suppression (the death) of an existing one, i.e., the so-called birth-death moves; or they can promote the splitting of an existing component into two components or the combination of two existing components into one (split-combine move). These moves are described in Ref. . In practice, at each step, the algorithm tries to jump to another value of (from 1 to 2, from to 2 to 1 or 3, from 3 to 2 or 4, and so on). The new value of is accepted with an acceptance probability that guarantees the convergence of the algorithm to the joint posterior distribution.
Once the convergence is attained, after a burn-in period, at each sweep , the MCMC algorithm draws a vector:
from the joint posterior . In practice, the output of the algorithm is a matrix of the form:
where each row is a particular draw of the whole parameter vector from its joint posterior , while each column refers to a particular parameter and represents the whole simulated marginal posterior distribution for that parameter, independently from the values observed for all the other parameters. Model choice can, then, be accomplished considering the first column of the matrix in Eq. (20), that is, the simulated marginal posterior . This column contains a string of values for (e.g., 1, 1, 1, 2, 2, 3, 2, 3, 4, 3,3,3, 2……..4,5,4,3, 4, 3, 2….). Therefore, the posterior probability that the number of modes is equal to a specific value , is given by the relative frequency of occurrence of the value in the strings, and the model chosen will be the one corresponding to the value of with the highest occurrence.
Once a particular model with, let us say, inelastic modes has been chosen, the parameters of this model can be estimated conditionally on . This means that we only need to consider a submatrix of the matrix in Eq. (20), made up of those rows for which the first column is equal to . Then, a certain parameter can be estimated taking the mean (or the mode) of the corresponding column of this sub-matrix, which represents the simulated posterior distribution for , conditionally on the model with modes and marginalized with respect to all the other parameters, i.e., .
In assessing convergence, a valuable tool is provided by trace plots, which show the sampled values of a parameter over the sweeps of the algorithm. Ideally, a trace plot should exhibit rapid up-and-down variation with no long-term trends or drifts. Imagining to break up this plot into a few horizontal sections, the trace within any section should not look much different from the trace in any other section. This indicates that the algorithm has converged to the posterior distribution of the parameters. Other convergence criteria can be found, for example, in Ref. . Figure 4 shows the trace plots of three DHO-mode frequencies the algorithm found, fitting a spectrum relative to IXS data on pure water recently measured (data not published) at room temperature and at a wave vector transfer , after the first 1000 (a), 10,000 (b), and 100,000 (c) sweeps. In plot (a), it can be seen how rapidly and reach their respective high-density regions, while has more problems in exploring the parameter space. Plot (b) shows that, after nearly 2000 sweeps, also finally starts oscillating around its mean, according to its posterior distribution. Plot (c) illustrates that a burn in of, for example, 10,000 sweeps is large enough to ensure convergence of the algorithm: the trace plots for the three parameters stabilize well before the end of the burn in period.
In Figure 5 , we report an example of Bayesian analysis applied to neutron Brillouin scattering data from liquid gold  at different values of the momentum transfer . In this work, after a proper removal of spurious effects such as background, self-absorption, and multiple scattering, the data look indeed rather structured so that inferring the number of inelastic components seems rather obvious and the result confirms the findings of a previous work . Estimates were obtained from sweeps of the algorithm, after a burn in of sweeps, and the running time for the algorithm was of approximately 5/10 minutes for each spectrum. We chose a precautionary large value for the burn-in, but convergence was normally achieved in a few hundreds of sweeps.
Even in this straightforward case, however, additional insights can be obtained from the posterior distributions delivered by the Bayesian inference. For example, in Figure 6 , it can be noticed that, as the value of increases, the posterior probability of also increases. This trend in the discrete distribution for as a function of could possibly convey interesting insights on the actual onset of a second excitation or simply indicate a progressive degradation of the experimental data or, still, suggest that, as the damping becomes more and more effective, the determination of the number of inelastic features becomes more controversial.
To investigate these issues, one can look, for example, at the posterior distributions for the excitation frequency , conditional to and to , respectively (see Figure 7 for an example on the same data of Figure 5 for a value of ). Considering the matrix in Eq. (20), as explained above, for , all the values in the column referring to , and in correspondence with the rows for which , are draws from , and a histogram of these values can be used to visualize the marginal posterior distribution of , conditional to . In the same way, for , all the values in the column referring to , and in correspondence with the rows for which , are draws from , while those in the column referring to and in the same rows represent draws from . Figure 7 illustrates, from left to right, the distributions , , and at .
The shape of these posterior distributions provides a measure of the precision with which the parameter is estimated. For example, is well-shaped, i.e., unimodal and approximately symmetric, yet quite dispersed. Its mean is equal to 23.8 , but there is a 95% probability that the value of lies in the large interva1 22.3–25.2 . This large interval tells us that many different values of are compatible with the data, signifying that the inelastic mode at is largely damped—as confirmed also by the large value —and less defined, which reveals the large uncertainty in the estimation of the undamped oscillation frequency of the excitation. If we now look at the posteriors for and , we can see that these are much worse shaped than , with unreasonably large or small values having nonvanishing probability. Their mean are, respectively, 17.6 and 25.5 and are outside the probability interval obtained for , when . Therefore, based on these findings, the -evolution of the posterior probability of seems to simply reveal the increasingly elusive discernment of distinct inelastic features as their damping, or broadening, increases. In practice, at the highest explored (16 nm−1), the oscillation mode becomes so highly damped that it can be fitted equally well either by two distinct DHO peaks or by a (broader) single one in the middle of the two. At this stage, the Occam’s razor comes into play, naturally integrated in the Bayesian model choice, which ultimately privileges the model with only one DHO, as it involves fewer free parameters. Imagine, instead, that were bimodal, with the two modes corresponding to the single modes of and , respectively, as observed, for instance, in Ref. . In this case, the bimodality of would have provided stronger support to the actual presence of two DHOs, thus suggesting that the finding only stemmed from the scarcity of data. Should this have been the case, additional observations would have probably led to privilege a more complex model.
Data discussed in Ref.  provide another example of the efficacy of Bayesian inference in enforcing the parsimony principle. Specifically, we refer to the case of an IXS measurement from a suspension of gold nanoparticles in water which has been analyzed with a model similar to the one in Eq. (14), yet with the DHO terms replaced by Dirac delta functions, due to the extremely narrow width of the measured excitations. For all explored, the posterior distributions for the number of inelastic modes have a maximum ( Figure 8 ), which is smaller than . In particular, we can also observe that the most probable number of modes and the related probability change from one dataset to the other; this partly reflects the physics of the phenomenon under study but also drawbacks of the modeling, such as the limited count statistics and the increasingly intertwined nature of spectral features at high .
As a further remark, we would like to stress again the fact that results from Bayesian inference are always to be interpreted in a probabilistic nuance. For instance, we stated before that the oscillation mode lies in the interval , with a probability of 95%. This interval, called credibility interval, is obtained by sorting the values of , drawn from its posterior conditional to , and taking the two values below which we can find, respectively, the 2.5% and 97.5% of all simulated values of . In practice, the values inside the interval are those with the highest density given the observed data and so the most credible. Classical confidence interval, obtained in the frequentist approach, does not have such a probabilistic interpretation. The interpretation of confidence intervals is that, if we imagine to repeat data sampling indefinitely under the same conditions and to build a confidence interval at a certain confidence level for each of the datasets, then (1–)% of these confidence intervals will contain the true fixed value of the parameter. However, we have no guarantee that the single confidence interval, calculated on the basis of the only dataset actually observed, contains the true parameter value. We can only be confident [at the (1–)% level] that it does so, since it comes from a set of intervals, (1–)% of which do contain the parameter value. In practice, under a frequentist approach, data are random variables and give rise to random intervals that have a specific probability of containing the fixed, but unknown, value of the parameter. The single interval is also fixed and might or not contain the fixed parameter, but we cannot associate any probability measure to this possibility. In the Bayesian approach, the parameter is random in the sense that we have a prior belief about its value, while the interval can be thought of as fixed, once the data have been observed. In summary, the frequentist approach do provide a definition of confidence intervals, which, however, are endowed with a robust probabilistic ground only with respect to the hypothetic space of all possible repetitions of the measurement experiment but not with respect to the unique dataset at hand.
4.2 Bayesian inference in the time domain
Time correlation function decays can be modeled in terms of an expansion of the intermediate scattering function in exponentials, and the aim is often to determine the number of time decay channels that could be envisaged in the relaxation of . In Ref. , the dynamics of polymer-coated gold nanoparticles in was tackled by neutron spin echo (NSE) scattering and analyzed within a Bayesian approach with the goal of establishing how many characteristic relaxations were present in a given spin echo time window and if they could be described by either simple or stretched exponentials or by a combination of the two. The data were assumed to be sampled by the following model:
where is a proportionality constant possibly enabling a data normalization, represents the number of exponential relaxations, is the weight of the j-th component of the exponential mixture, its relaxation time, and its stretching parameter. The , for , are random noises, accounting for statistical errors in the measurements. These are assumed to be independent and identically distributed with a normal distribution , where is the measurement error corresponding to the i-th observation and is a proportionality constant. As a consequence, the likelihood of the data is a product of normal densities, each having mean and variance .
The value of is, obviously, unknown, and its determination is of great relevance. Therefore, also in this case, is considered a stochastic variable to be estimated based on the data and conditional to all the other model parameters. Imagine that we have no clue about how many relaxations are necessary to describe the observed behavior of the time correlation function. However, we are aware that, in a case like this, the risk to over-parametrize the model is high, and we certainly know that, given the finite time window covered by the experiment and the limited number of experimental data, the number of relaxations should not be too large; otherwise the results could be meaningless, hardly justifiable, and unlikely. Therefore, it seems a priori reasonable that has a uniform distribution on the discrete values , where is a small integer, as previously assumed when dealing with the number of excitations in the energy spectrum. Also, the relaxation times are supposed uniformly distributed on a continuous range of nonnegative values. The prior on is tailored to ensure that the combination of relaxation terms fulfills the constraints and . The natural choice for the prior of is, then, a Dirichlet density, which takes values on the standard simplex. crucial prior is that of the stretching parameter . This is specifically meant, in fact, to discern whether the relaxations in the given time window are simple exponential decays, stretched exponential decays, or a combination of the two. A simple exponential decay corresponds to , and thus a positive probability mass can be assigned to this specific value. The remaining probability can be assigned to values within the interval . Therefore, a reasonable prior for can be a mixed distribution made up of a probability mass in 1 and a continuous beta density, i.e., , independently for , where and are parameters of the beta density, is an indicator function equal to 1 when and otherwise, and is a weight denoting our prior support in favor of a stretched, rather than simple, exponential components. Once the and weights are, respectively, assigned to the sums of simple and stretched exponential terms in Eq. (21), other weights will be associated to mixed combinations of these decay terms. In particular, a means that the j-th exponential can be either stretched or not with a priori the same probability, for all . In addition, setting and allows to assume that the stretching parameters are uniformly distributed on the interval . This corresponds to an uninformative prior giving a probability of 0.5 to both a stretched or unstretched component and, for a stretched component, assigning the same density to any value of in . Obviously, more informative priors can be chosen, e.g., by assigning different values to and , so to favor, for example, a Zimm or Rouse model specification (see discussion in Ref. ) when dealing with polymer dynamics. similar prior probability can be adopted for the proportionality constant , i.e., mixed distribution made up of a continuous beta density and a probability mass in 1, corresponding to no need for a refinement of the data normalization process. Finally, the proportionality constant in the error variance, , can be, for example, assumed to have a priori a gamma density so that only nonnegative values are allowed.
Let us consider one of the datasets in Ref. , representing the time correlation decay of a polymer solution of polyethylene glycol with a molecular weight of (PEG2000) as measured in a NSE scattering experiment and collected at a momentum transfer . Also in this case, we allowed for sweeps of the algorithm and a burn-in of sweeps, resulting in approximately 5/10 minutes of computing time. From the output of the MCMC-RJ algorithm, values for the discrete posterior distribution function of k are found in Table 1 .
The most visited model is the one with two exponential functions. The fit is shown in the figure below ( Figure 9 ).
When we model a spectroscopic dataset through a homogeneous mixture, e.g., a linear combination of exponentials, Lorentzians or DHO functions, the posterior distribution for the number of components always has at least a maximum, unless the data are so scarcely informative that the posterior for simply reproduces the prior, which might be uniform. In principle, when jumping in a more complicated model characterized by a larger number of parameters, the tends to decrease, and the likelihood tends to increase. However, according to the Bayes theorem, the posterior for is computed averaging the likelihood over all the parameters value (see Eq. (12)). Therefore, models that are under-parametrized will perform poorly on average since they just cannot fit the data well enough and have a small likelihood, while models that are over-parametrized will also perform poorly on average, because the subset of the parameter space that fit the data well (and where the likelihood is high) becomes tiny compared to the whole volume of the parameter space. This means that adding components to the mixture model increases the posterior distribution of only until the increment in the likelihood more than compensates for the augmentation of the “wasted” parameter space; overall the competition of these effects ensures the presence of a maximum in . It is worth noticing that assuming a model with more free parameters does not necessarily mean a better fit, once the likelihood has saturated. To see this, we report here below ( Figure 11 ) the fit we get with a number of relaxation channels . We can observe how the fit with three relaxation components or more is not better than the one more supported by the available data and estimated by the MCMC-algorithm. Moreover it is insane and hopeless to confer a distinct physical meaning to each one of the corresponding characteristic relaxation times.
Let us introduce a quantity which could resemble the , namely:
which measures the distance between the experimental data and the best fit determined with the -MCMC algorithm, where is the number of experimental observations, are the experimental data, are the best fit calculated values, are the experimental errors. This variable differs from the usual as the model parameters are not estimated by least squares minimization, but are the averages, of the corresponding marginal posterior distributions. Nevertheless we can use this quantity to show what follows. If we calculate the quantity in Eq. (22) for each value of , we get for the values reported in Table 2 which indicates an overall decrease upon increasing the number of exponentials. Actually, does not strictly decrease with the numbers of parameters, because, as mentioned before, the fit is not calculated with parameter values which minimize the . If, for example, in particular situations (e.g., for ), the algorithm faces some challenges in determining a parameter and its posterior distribution is very broad and slowly decaying, the average of this parameter could be severely affected by the presence of these sizable distribution tails. In these cases, the mode of the distribution should be used instead to estimate the parameter.
Nevertheless, shows that even with a distance between experimental and fitted values which is effectively decreasing as the number of parameters increases, the most probable model from Table 1 is the one with and not the one with . The effect of the razor is evident. It can also be noted that the fit with is not only the most probable but it is also the best (in the sense that it is much better than the one with and it is not worse than those obtained using a larger number of exponentials). More interestingly, we observe also that increasing the number of parameters, the (or any other measure of the distance between the fitted and the observed data) is not decreasing much for large values of , because obviously at the end, this quantity is going to saturate (and so does the likelihood). The fit with determines a value of , which is not too different from the one we get with . Incidentally, as it is largely discussed in Ref. , the model with two relaxation channels has also a perfectly plausible and consistent explanation, which would not be possible if a more complicated model were chosen.
In summary, we have here shown some of the opportunity offered by a Bayesian inference analysis of experimental results and, in particular, those obtained with spectroscopic methods. As possible future development, it appears very promising the opportunity of applying similar methods to the joint analysis of complementary time or frequency-resolved measurements. Also, we can envisage the use of more informative priors implementing the fulfillment of sum rules of the spectra or any other known physical constraint of the measurement. We are confident that, in the long run, these methods will improve the rigor of routine data analysis protocols, supporting a probability-based, unprejudiced interpretation of the experimental outcome.
This work used resources of the National Synchrotron Light Source II, a US Department of Energy (DOE) Office of Science User Facility operated for the DOE Office of Science by Brookhaven National Laboratory under Contract No. DE-SC 0012704. The open access fee was covered by FILL2030, a European Union project within the European Commission’s Horizon 2020 Research and Innovation programme under grant agreement N°731096.
We would like to thank U. Bafile, E. Guarini, F. Formisano, and M. Maccarini for the very stimulating discussions.