Bayesian Inference for Regularization and Model Complexity Control of Artificial Neural Networks in Classification Problems

Son T. Nguyen; Tu M. Pham; Anh Hoang; Linh V. Trieu; Trung T. Cao

doi:10.5772/intechopen.1002652

Abstract

Traditional neural network training is usually based on the maximum likelihood to obtain the appropriate network parameters including weights and biases given the training data. However, if the available data are finite and noisy, the maximum likelihood-based network training can cause the neural network after being trained to overfit the noisy data. This problem has been overcome by using the Bayesian inference applied to the neural network training in various applications. The Bayesian inference can allow values of regularization parameters to be found using only the training data. In addition, the Bayesian approach also allows different models (e.g., neural networks with different numbers of hidden units to be compared using only the training data). Neural networks trained with Bayesian inference can be also known as Bayesian neural networks (BNNs). This chapter focuses on BNNs for classification problems with considerations of model complexity of BNNs conveniently handled by a method known as the evidence framework.

Keywords

Bayesian inference
artificial neural networks
classification problems
model complexity
evidence framework

Author Information

Show +

Son T. Nguyen*
- Hanoi University of Science and Technology, Hanoi, Vietnam
Tu M. Pham
- Hanoi University of Science and Technology, Hanoi, Vietnam
Anh Hoang
- Hanoi University of Science and Technology, Hanoi, Vietnam
Linh V. Trieu
- Hanoi University of Science and Technology, Hanoi, Vietnam
Trung T. Cao
- Hanoi University of Science and Technology, Hanoi, Vietnam

*Address all correspondence to: son.nguyenthanh@hust.edu.vn

1. Introduction

Multi-layer perceptron (MLP) neural networks are commonly used in various applications of neural networks. The generalization of the neural network after being trained or how well the trained network can have predictions with the new cases is the key challenge when developing the MLP neural network. Indeed, an insufficient network complexity can result in “underfitting” phenomena, in which significant data may be ignored. In contrast, an excessive network complexity can cause “overfitting”, in which the noisy data are also fitted. The complexity of the MLP neural network depends on magnitudes of weights and biases. In addition, the network size depending on the number of hidden nodes can be also known as a useful factor to measure the network complexity.

For a long period, the Bayesian approach has been used to improve the generalization capabilities of MLP neural networks despite limited or noisy data [1, 2, 3]. Instead of the consideration of magnitudes of weights and biases only, the Bayesian technique considers the distribution of weights and biases which can result a good generalization for the network after being trained. Neural networks trained based on the Bayesian inference are also known as Bayesian neural networks (BNNs).

Until now, BNNs have been utilized in many important applications: fault diagnosis of power transformers using dissolved gas analysis [4], protein secondary structure prediction [5], hands-free control of power wheelchairs for severely disabled people [6, 7, 8], detection of hypoglycaemia in children with type 1 diabetes [9, 10], fault identification in cylinders [11], near-infrared spectroscopy [12], and electric load forecasting [13].

As BNNs can be effective for various classification problems, this chapter aims at providing a procedure for deploying BNNs for classification with the following items:

Exploring the evidence framework to allow all available data to be used to train the BNNs. This work is needed when the collection of relatively large data is expensive or time-consuming.
As the traditional gradient descent algorithm with a fixed step size and search direction to search local minima usually causes an excessively large network training time, this chapter also focuses on three advanced optimization training algorithms for BNNs including conjugate gradient, scaled conjugate gradient and quasi-Newton algorithms to automatically adjust the step size and search direction to optimal values.
As finding the optimal network architecture is often a time-consuming task, this chapter also mentions the concept of the Bayesian model comparison to choose the best network architecture. This work is done by evaluating the evidence for different candidature BNNs with varied numbers of hidden nodes. The network architecture which can give the highest value of the evidence will be chosen for the final use.

2. Bayesian learning for neural networks

The objective of maximum likelihood network learning approaches is to minimize a data error function by finding an appropriate single vector of weights and biases. In contrast, the Bayesian technique considers a probability distribution over the weight and bias space, which represents the relative level of belief in various vectors of weights and biases. This work firstly sets a prior distribution. Using Bayes’ theorem, the observed data can be transformed from the prior distribution into a posterior distribution that can be used to assess the network’s prediction with new values of the input variables.

2.1 The prior distribution of weights and biases

The weights and biases of a neural network can be split into G groups. Let Wg denote the number of weights and biases in group g and wg denote the vector of weights and biases in group g. Conveniently, the prior distribution of the vector of weights and biases in group g can be assumed to be a zero-mean Gaussian distribution as:

pwgξg=ξg2πexp−12ξgwg2E1

where ξg is the hyperparameter to constraint the weights and biases in group g. The prior weight and bias distribution in (1) can be defined by assuming that positive and negative weights and biases are equally frequent and have a finite variance.

The prior distribution of all the vectors of weights and biases w is given by:

pwψ=1ZWψexp−∑g=1GξgEWgE2

where EWg=12wg2 is known as the weight function corresponding to group g and ψ is the vector of hyperparameters having the following form:

ψ=ξ1…ξGTE3

ZWψ is a normalization constant which is given by:

ZWψ=∏g=1G2πξgWg/2E4

2.2 The posterior distribution of weights and biases

The weights and biases can be adjusted to their most probable (MP) values given the training data D. We can also compute the posterior weight distribution using Bayes’ theorem in the form below:

pwDψ=pDwψpwψpDψE5

Eq. (5) can be expressed in words as follows:

Posterior=Likelihood×PriorEvidenceE6

The posterior distribution of weights and biases is also assumed to be a Gaussian distribution with the following form:

pwDψ=exp−SwZSψE7

where ZSψ is called the normalization constant and given by:

ZSψ=∫exp−SwdwE8

In the Bayesian inference, the most probable vector of weights and biases, wMP, must maximizes the posterior distribution of weights and biases. This task is equivalent to searching for the minimization of a cost function.

2.3 The posterior probability of Hyperparameters

Until now, the most probable vector of weights and biases, wMP, given specific values of the hyperparameters, have not yet been computed. Therefore, the hyperparameters must be firstly determined given the model and data. Again, we can express the posterior distribution of the hyperparameters using Bayes’ theorem as follows:

pψD=pDψpψpDE9

In eq. (9), the denominator pD does not depend on the hyperparameters. pψ is the prior distribution of the hyperparameters, which is simply assumed to be a uniform distribution. pψ can be later ignored because to infer the values of the hyperparameters, therefore, we only need to search the values of the hyperparameters to maximize pDψ, which is called “the evidence” in (5). The evidence can be exactly calculated by evaluating the following integral:

pDψ=∫pDwψpwψdwE10

where pDwψ is the probability of the data, traditionally called “the dataset likelihood” and has the form:

pDwψ=exp−EDE11

where ED=−∑n=1N∑k=1ctknlnzkn is called the “entropy” data error function with zkn is the k-th output corresponding to the n-the training pattern and tkn is the target corresponding to the k-th output and the n-the training pattern.

Substituting (11) and (2) into (10) results in:

pDψ=∫exp−ED1ZWψexp−∑g=1GξgEWgdw=ZSψZWψE12

2.4 The Gaussian approximation to the evidence

To compute ZSψ, the cost function Sw can be approximated around the most probable weight vector wMP by the quadratic form:

Sw=SwMP+12w−wMPTAw−wMPE13

where A is the Hessian matrix of the cost function Sw evaluated at wMP and is given by:

A=H+∑g=1GξgIgE14

In (14), H=∇∇EDwMP, the Hessian matrix of the data error function ED evaluated at wMP, and Ig=∇∇EWg is a diagonal matrix having ones along the diagonal that picks off weights and biases in the group g of the weights and biases. Since the error function Sw the negative log probability of the weight posterior probability, ZSψ in (12) is a Gaussian integral that can be approximated to the form:

ZSψ≈exp−SwMP2πW/2detA−1/2E15

By substituting (15) and (4) into (12), we obtain the evidence as follows:

pDψ=exp−SwMP2πW/2detA−1/2∏g=1G2πξgWg/2=exp−SwMPdetA−1/2∏g=1GξgWg/2E16

Taking the logarithm of (16) gives:

lnpDψ=−SwMP−12lndetA+∑g=1GWg2lnξgE17

Eq. (17) is the basis for the determination of the hyperparameters.

2.5 Determination of the Hyperparameters

The most probable hyperparameters are determined by taking the derivative of (17) regarding ξg (g=1,…,G).

∂∂ξglnpDψ=−EWgMP+Wg2ξg−12trA−1IgE18

where EWgMP is the data error function evaluated at the most probable vector of weights and biases wMP. By setting (18) to zero, we can obtain the following relationship:

2EWgMPξg=Wg−ξgtrA−1Ig=γgE19

The right-hand side of Eq. (19) is equal to a value γg defined as the number of “well-determined” parameters for the weights and biases in group g. Finally, ξg is given by:

ξg=γg2EWgMPE20

3. Advanced training algorithms for BNNs

The BNN training is equivalent to minimizing a cost function Sw, which is a highly non-linear function of the weights and biases. In addition, the matter of minimizing continuous and differentiable functions of multi-dimensional variables has been widely investigated, and various methods can be directly applicable to the problem of neural network training.

The neural network training includes a sequence of iterative steps. At the m-th step, a step size αm is made in the search direction dm to update the weights and biases or minimize the cost function over the weight and biases space as follows:

wm+1=wm+αmdmE21

In the conventional gradient descent algorithm, the step size αm is fixed for every step and is usually called the learning rate η. The algorithm also makes the simplest choice for dm by setting it to −gm (negative gradient).

wm+1=wm−ηgmE22

One of the disadvantages of the gradient descent technique is that it requires the choice of a suitable value for the learning rate η. If setting the value of η “too” large, then the function value may increase. If the setting η sufficiently small, the function value can decrease steadily. However, if very small reductions in the function values are taken, then any limiting sequence may not even be a local minimum. To overcome this drawback, a procedure called “line search” should be used to exactly find a local minimum of the cost function, given the search direction.

One very simple technique for reducing iterative steps is to use a “momentum” term μ0<μ≤1 as follows:

wm+1=wm−ηgm+μwm−wm−1E23

The use of momentum can result in a significant improvement in the performance of gradient descent. However, the inclusion of this term also may give a second parameter needed to be chosen appropriately. Therefore, it is necessary to consider algorithms that can properly choose the step size and search direction. This section presents three advanced optimization training algorithms for BNNs in classification problems. They are conjugate gradient, scaled conjugate gradient, and quasi-Newton algorithms, which belong to a class of automated non-linear parameter optimization techniques. In contrast with the gradient descent algorithm, which depends on parameters specified by the user, these algorithms can automatically adjust the step size and search direction to obtain fast convergence for network training.

3.1 Line search

The line search is required for the conjugate gradient algorithm and quasi-Newton algorithm. The line search is a process for searching the step size αmin to minimize the cost function Sw given the search direction dm. A line search consists of two stages:

Bracketing the minimum: this stage aims to find an interval that includes a triple a<b<c such that Sa>Sb and Sc>Sb. Since the cost function is continuous, this always ensures that there is a local minimum somewhere in the interval ab.
Locating the minimum itself: since the cost function is smooth and continuous, the minimum can simply be obtained by a process of parabolic interpolation. This involves fitting a quadratic polynomial to the cost function through three successive points a<b<c and then moving to the minimum of the parabola.

3.1.1 Bracketing the minimum

For a given bracketing triple abc, we wish to find a new bracket as narrow as possible, as shown in Figure 1. This task requires choosing a new trial point x∈ac and verifying this point. The position of the new trial point x can be determined using the “golden section search”:

Figure 1.
Golden section search: Initial bracket (1, 2, 3) becomes (4, 2, 3), (4, 2, 5), etc.

c−xc−a=1−1φE24

where φ is called the “golden ratio”.

φ=121+5=1.6180339887E25

If x>b then the new bracketing triple is abx if Sx>Sb and is bxc if Sx<Sb. A similar choice is made if x<b. This algorithm is very robust since no assumption is made about the function being minimized, other than continuity.

3.1.2 Locating the minimum itself

As shown in Figure 2, the minimum point of the cost function inside the interval ac can be approximated as a point d that is the minimum of a parabola fitted through three points aSa, bSb and cSc as:

Figure 2.
An illustration of the process of parabolic interpolation used to perform line search minimization: A parabola (shown dotted) is fitted to the three points a, b, c. The minimum of the parabola, at d, gives an approximation to the minimum of Sα.

d=b−12b−a2Sb−Sc−b−c2Sb−Sab−aSb−Sc−b−cSb−SaE26

However, Eq. (26) does not guarantee that d always lies inside the interval ac. Also, if b is already at or near the minimum of the interpolating parabola, then the new point d is close to b. This causes slow convergence if b is not close to the local minimum. To avoid this problem, this approach must be combined with the golden section search for finding the exact local minimum.

3.2 Conjugate-gradient training algorithm

In the conjugate-gradient training algorithm, the new search direction is chosen so that [14]:

dmAmdm−1=0E27

dm is the search direction at step m and dm−1 is the search direction at step m−1. Am is the Hessian matrix of the cost function evaluated at wm. We say that dm is conjugate to dm−1. If the cost function has the quadratic form, the step size dm can be written in the form:

αm=gmTdmdmTAmdmE28

According to Eq. (28), the determination of αm requires the evaluation of the Hessian matrix Am. However, αm can be found by performing a “very accurate” line search. The new search direction is then given by:

dm+1=−gm+1+βmdmE29

where

βm=gm+1−gmTgm+1gmTgmE30

Eq. (30) is called the Polak-Ribiere expression [2, 3].

Finally, the key steps of the algorithm can be summarized as follows:

Choosing an initial weight vector w1.
Evaluating the gradient vector g1 and setting the initial search direction d1=−g1.
At step m, minimizing Swm+αdm regarding α to give wm+1=wm+αmindm.
Testing to see whether the stopping criterion is satisfied.
Evaluating the new gradient vector gm+1.
Evaluating the new search direction using Eqs. (29) and (30).

3.3 Scaled conjugate-gradient training algorithm

The use of line search allows the step size in the conjugate gradient algorithm to be chosen without having to evaluate the Hessian matrix. However, the line search itself causes some problems. Every line minimization involves several computationally expensive error function evaluations.

The scaled conjugate gradient algorithm avoids the line-search procedure of the conventional conjugate gradient algorithm [15]. To make the denominator of (28) always positive to avoid the increase in the value of the cost function, a multiple of the identity matrix is added to the Hessian matrix in the denominator of (28) to obtain the following parameter:

αm=gmTdmdmTAmdm+λmdm2E31

where λm is a “non-negative scale parameter” for adjusting the step size αm. The denominator in (31) can be written as:

δm=dmTAmdm+λmdm2E32

If δm<0, we can increase λm to make δm>0. Let the raised value of λm be λ¯m. Then the corresponding raised value of δm is given by:

δ¯m=δm+λ¯m−λmdm2E33

To make δ¯m>0, we need to choose:

λ¯m=2λm−δmdm2E34

Substituting (34) into (33) gives

δ¯m=−δm+λmdm2E35

δ¯m is now positive. This value is used as the denominator in (31) to compute the step size αm. To find λm+1, a comparison parameter is firstly defined as:

Δm=2Swm−Swm+αmdmαmdmTdmE36

Then the value of λm+1 can be adjusted using the following prescriptions:

IfΔm>0.75thenλm+1=λm2E37

IfΔm<0.25thenλm+1=4λmE38

In Eq. (31), dmTAdm can be approximated as:

dmTAmdm≈dmT∇Swm+σdm−∇SwmσE39

where σ=σ0dm and σ0 is chosen to be a very small value.

By substituting (39) into (31), we obtain:

αm=gmTdmdmT∇Swm+σdm−∇Swmσ+λmdm2E40

Eq. (40) is used to scale the step size. The new search direction is also determined using the same procedure in the conjugate gradient algorithm.

3.4 Quasi-Newton training algorithm

In the Newton method, the network weights can be updated as:

wm+1=wm−Am−1gmE41

The vector −Am−1gm is called the “Newton direction” or the “Newton step”. However, the evaluation of the Hessian matrix can be very computational. If the cost function is not quadratic, the Hessian matrix may be no longer positive definite causing an increase in the cost function value.

From Eq. (41), we can form the relationship between the weight vectors at steps m and m+1 as:

wm+1=wm−αmFmgmE42

From (42), if αm=1 and Fm=Am−1, we have the Newton method, while if Fm=I, we have gradient descent with learning rate αm.

Fm can be chosen to approximate the Hessian matrix. In addition, Fm must be positive definite so that for small αm we can obtain a descent method. In practice, the value of αm can be found by a line search. Eq. (42) is known as the quasi-Newton condition. The most successful method to compute Fm is the Broyden-Fletcher-Goldfarb-Shanno (BFGS) formula [2, 3]:

Fm+1=Fm+ppTpTv−FmvvTFmvTFmv+vTFmvuuTE43

where p, v and u are defined as:

p=wm+1−wmE44

v=gm+1−gmE45

u=ppTv−FmvvTFmvE46

Now, the line search with great accuracy is no longer required as the line search does not form a critical factor in the algorithm. For the conjugate gradient algorithm, the line search needs to be performed accurately to ensure that the search direction is set correctly.

4. Bayesian model comparison

When the Bayesian inference is applied to neural network training, the principle of Bayesian model comparison can be utilized to select the optimal number of hidden nodes in the neural network, given the training data. Suppose that there are a set of neural networks Xi with different numbers of hidden nodes. According to Bayes’ theorem, we can write down the posterior probabilities of the network Xi, once the training data set D has been observed, in the form:

PXiD=pDXiPXipDE47

where PXi is the prior probability of the network Xi and the quantity pDXi is referred to as the “evidence” for Xi. If there is no reason to assign different priors to the candidature neural networks, the relative probabilities of the networks can be compared based on their evidence.

The evidence of the network Xi can be precisely computed as:

pDXi=∫pDwXipwXidwE48

If the posterior distribution is sharply peaked in the weight space around the most probable weight vector, wMP, then the integral (48) can be approximated as:

pDXi≈pDwMPXiΔwposteriorΔwpriorE49

where Δwprior and Δwposterior are the prior and posterior uncertainties in the weights. The ratio Δwposterior/Δwprior is called the Occam factor.

Eq. (49) can be expressed in words as follows:

Evidence=Bestfitlikelihood×Occam factorE50

The best-fit likelihood measures how well the network fits the data and the Occam factor (<1) penalizes the network for having the weight vector w. A network that has a large best-fit likelihood will receive a large contribution to the evidence. However, if the network has many hidden nodes, then the Occam factor will be very small. Therefore, the network with the largest evidence will make a trade-off between needing a large likelihood to fit the data well and a relatively large Occam factor, so that the model is not too complex.

The logarithm of evidence for the network Xi containing the weight vector w and G hyperparameters ξg, at the most probable weight vector, wMP, is given by:

lnEvXi=−EDwMP+lnOccwMP+∑g=1GlnOccξgMPE51

where EDwMP is the data error evaluated at wMP. OccwMP and OccξgMP are the Occam factors for the weight vector and the hyperparameters at wMP.

4.1 The Occam factor for the weights and biases

The Occam factor for the weights and biases with the Gaussian distributions is given by:

Occw=exp−∑g=1GξgMPEWgMP∏g=1GξgMPWg/2detA−1/2E52

where ξgMP is the most probable value of the hyperparameter for the weights and biases in group g evaluated at wMP and EWgMP is the weight error in group g evaluated at wMP.

We now consider a two-layer perceptron network with “tanh” activation functions in the hidden layer. There are two kinds of symmetries in the network:

Firstly, by changing the signs of all incoming and outgoing weights of a hidden node, we can obtain an identical mapping from the input nodes to the output nodes. For a network with M hidden nodes, there are 2M equivalent weight vectors that result in the same mapping from the inputs to the outputs of the network.
Secondly, interchanging the values of all incoming and outgoing weights of a hidden node with the corresponding values of the weights associated with another hidden node also results in an identical mapping. For a network with M hidden nodes, there are M! of those permutations.

Therefore, a minimum value of the cost function Sw can be obtained with 2MM! equivalent weight vectors. Hence, Occw in Eq. (52) should be multiplied by 2MM! for a fully connected network:

Occw=exp−∑g=1GξgMPEWgMP∏g=1GξgMPWg/2detA−1/22MM!E53

4.2 The Occam factor for the Hyperparameters

For the hyperparameter ξg, the logarithm of the evidence for (ξg,Xi) is assumed to be a quadratic form in lnξg as:

lnPDξgXi=lnPDξgMPXi−12lnξg−lnξgMP2σlnξg2E54

where σlnξg is the variance of the distribution for lnξg, and at ξg=ξgMP it has the form:

σlnξg−2=−ξg2∂2∂2ξglnpDξgXiE55

So

PDXi=∫PDξgXiPlnξgXidlnξg=PDξgMPXi∫exp−12lnξg−lnξgMP2σlnξg21lnΩdlnξg=PDξgMPXiOccξgE56

where

Occξg=2πσlnξglnΩE57

The parameter Ω can be set to a specific value, for example 103, which indicates a subjective estimate of the hyperparameter. However, in practice, Ω is only a minor factor, as it is the same for every network. Finally, Occξg denotes the Occam factor for the hyperparameter ξg.

In Eq. (57), σlnξg can be approximated as:

σlnξg≈2γgE58

Substituting (58) into (57) gives:

Occξg=4π/γglnΩE59

4.3 Combining the terms of evidence

We can now combine the terms of evidence. Firstly, taking the logarithm of (52) gives:

lnOccw=−∑g=1GξgMPEWgMP+∑g=1GWg2lnξg−12lndetA+lnM!+Mln2E60

Similarly, taking the logarithm of (59) gives:

lnOccξgMP=12ln4πγgMP−lnlnΩE61

Substituting (60) and (61) in (51) gives:

lnEvXi=−SwMP+∑g=1GWg2lnξg−12lndetA+lnM!+Mln2+∑g=1G12ln4πγgMP−GlnlnΩE62

where Wg is the number of weights and biases in group g. Ω is a factor because it is the same for all models and does not affect the relative comparison of log evidence of different neural networks. Therefore, Eq. (62) can be conveniently used to rank the complexities of different networks [12] and we may expect that the neural network with the highest evidence will provide the best results on unseen data.

5. A case study on power transformer fault classification using dissolved gas analysis

In power generation, transmission, and distribution networks, power transformers are a common type of electrical equipment. Typically, arcing, corona discharges, sparking, and overheating in insulating materials are the results of power transformer defects that are just beginning. In response to these stressors, insulating materials may deteriorate or break down, releasing several gases. Since these dissolved gases can be analyzed, it is possible to learn useful information about the materials and fault conditions that are involved. Power transformer insulating oil dissolved gas analysis (DGA) is a widely used method for assessing the health of power transformers. By analyzing various gas concentration ratios such as Doernenburg ratios, Rogers ratios, and Duval’s triangle method, one can perform conventional analysis procedures of dissolved gases.

5.1 Conventional methods of DGA for power transformer insulating oil

The main reasons for gas generation inside a power transformer in operation are evaporation, electrochemical and thermal degradation, and decomposition. Bonds between carbon and hydrogen and carbon and carbon are broken in fundamental chemical processes. The gases hydrogen (H₂), methane (CH₄), acetylene (C₂H₂), ethylene (C₂H₄), and ethane (C₂H₆) can all be produced by combining active hydrogen atoms and hydrocarbon fragments produced by this event. With cellulose insulation, methane (CH₄), hydrogen (H₂), monoxide (CO), and carbon dioxide (CO₂) can be produced via heat decomposition or electrical faults. “Key gases” is the common term for these gases.

Measuring the concentration level (in ppm) of each important gas comes first in the analysis of DGA results after samples of transformer insulating oil are taken. Once critical gas concentrations exceed the usual range, analysis procedures should be employed to identify any potential transformer defects. These methods entail computing important gas ratios and comparing those ratios to recommended limits. The methods based on the following gas ratios—CH₄/H₂, C₂H₂/C₂H₄, C₂H₂/CH₄, C₂H₆/C₂H₂, and C₂H₄/C₂H₆—are Doernenburg ratios and Rogers ratios, respectively. Tables 1 and 2, respectively, display the suggested upper and lower bounds for the Doernenburg ratios approach and the Rogers ratios method.

Suggested fault diagnosis	R1=CH4H2	R2=C2H2C2H4	R3=C2H2CH4	R4=C2H6C2H2
Thermal decomposition	>1.0	<0.75	<0.3	>0.4
Partial discharge	<0.1	—	<0.3	>0.4
Arcing	>0.1−<1.0	>0.75	>0.3	<0.4

Table 1.

Suggested limits of Doernenburg ratios method.

Suggested fault diagnosis	R1=CH4H2	R2=C2H2C2H4	R5=C2H4C2H6
Unit normal	>0.1−<1.0	<0.1	<1.0
Low-energy density arcing-PD	<0.1	<0.1	<1.0
Arcing-high energy discharge	0.1−1.0	0.1−3.0	>3.0
Low temperature thermal	>0.1−<1.0	<0.1	1.0−3.0
Thermal<700°C	>1.0	<0.1	1.0−3.0
Thermal>700°C	>1.0	<0.1	>3.0

Table 2.

Suggested limits of Rogers ratios method.

In Duval’s triangle approach, the amounts of three important gases—methane (CH₄), acetylene (C₂H₂), and ethylene (C₂H₄)—are calculated. The percentage associated with each gas is then calculated by dividing its concentration by the sum of the amounts of the other three gases. To determine a diagnosis, these results are then plotted in Duval’s triangle as seen in Figure 3. Partially discharged (PD), low-energy discharge (D1), high-energy discharge (D2), thermal fault (T1), thermal fault (T2), and thermal fault (T3) are all indicated by sections of the triangle.

5.2 Results and discussion

BNNs were tested and trained using the IEC TC10 databases. There is an output pattern that corresponds to each input pattern and describes the fault type for a specific diagnosis criterion. In this study, five important gases—hydrogen (H₂), methane (CH₄), ethylene (C₂H₄), ethane (C₂H₆), and acetylene (C₂H₂)—that are all combustible are used. Five fault kinds are indicated by the output vector’s codes of 0 and 1, which are shown in Table 3. As indicated in Table 4, the training set was created by taking 81 data samples, and the test set was created by taking 36 data samples.

Fault type	Output vector
PD	10000T
D1	01000T
D2	00100T
T1 & T2	00010T
T3	00001T

Table 3.

Fault types and corresponding output vectors.

	Numbers of data samples
Fault type	Training set	Test set
PD	5	4
D1	18	8
D2	36	12
T1 & T2	10	6
T3	12	6
Total	81	36

Table 4.

Datasets from the IEC TC 10 database.

Low dissolved gas concentrations of a few ppm (parts per million) are typical for power transformers. But hundreds or tens of thousands of ppm can frequently be brought on by malfunctioning power transformers. The dissolved gas measurements are typically difficult to visualize because of this issue. The order of magnitude of DGA concentrations, rather than their absolute values, can be used to determine the characteristics of DGA data that are the most meaningful. The log10 is a useful way for simple comprehension of DGA data.

To use with the neural network training, the training data needs to be normalized for a range of 0 and 1 by using the following formula:

yi=xi−minXmaxX−minXE63

5.2.1 The network training procedure

Different BNNs with varied numbers of hidden nodes were trained to find the optimal number of hidden nodes (number of nodes in the hidden layer). The following characteristics apply to these networks:

The magnitudes of the weights on the connections from the input nodes to the hidden nodes, the biases of the hidden nodes, the weights on the connections from the hidden nodes to the output nodes, and the biased of the output nodes were handled by four hyperparameters ξ1, ξ2, ξ3, and ξ4.
The number of network inputs is determined by the number of gas ratios used in a particular diagnosis procedure, plus one augmented input with a fixed value of 1.
As indicated in Table 3, there are five outputs, each of which corresponds to a certain category of errors. For a specific number of hidden nodes, 10 neural networks with different initial weights and biases were trained.

The training procedure was implemented with the following steps:

The weights and biases in four groups were initialized by random selections from zero-mean and unit variance Gaussians. The hyperparameters were also initialized to be small values.
The scaled conjugate gradient technique was used to minimize the cost function.
The values of the hyperparameters were re-estimated in accordance with Eqs. (19) and (20) once the cost function has reached a local minimum.
Steps 2 and 3 were repeated until either the cost function value is smaller than a predefined value, or the required number of training iterations had reached.

5.2.2 Power transformer fault classification

Power transformer faults can be categorized according to gas ratios including Doernenburg and Rogers ratios.

5.2.2.1 Doernenburg ratios

The input vector of the network in this case is a vector consisting of four elements as follows:

x=CH4H2,C2H2C2H4,C2H2CH4,C2H6C2H2TE64

The training set was used to train several classification BNNs with various numbers of hidden nodes. Ten BNNs with various randomly chosen beginning weights and biases were trained for a specific number of hidden nodes, and the log evidence was then computed. The networks with two hidden nodes had the maximum log evidence, as seen in Figure 4. In addition, Figure 5 displays the best overall accuracy of fault classification, which is like the highest log evidence in Figure 4.

Figure 4.
Log evidence versus the number of hidden nodes (Doernenburg ratios).

Figure 5.
Overall accuracy in relation to the number of hidden nodes (Doernenburg ratios).

Table 5 illustrates the number of well-determined parameters and the change in four hyper-parameters. According to Table 6, the confusion matrix of the optimized BNN that was used to categorize the unknown input vectors can be obtained at a rate of 83.33%.

Period	ξ1	ξ2	ξ3	ξ4	γ
1	0.022	0.044	0.008	0.409	18.555
2	0.039	0.083	0.006	0.753	15.803
3	0.061	0.134	0.005	0.865	15.451

Table 5.

Changes in four hyper-parameters and the number of well-determined parameters over various hyper-parameter re-estimation times (Doernenburg ratios).

		Predicted classification
Actual classification	Fault	PD	D1	D2	T1&T2	T3
	PD	2	0	0	2	0
	D1	0	5	3	0	0
	D2	0	0	12	0	0
	T1&T2	0	0	0	5	1
	T3	0	0	0	0	6
Accuracy (%)	83.33

Table 6.

The BNN’s confusion matrix for categorizing unknown input vectors (Doernenburg ratios).

5.2.2.2 Rogers ratios

The input network vector in this case is formed based on four gas ratios as follows:

x=CH4H2,C2H2C2H4,C2H4C2H6,C2H6CH4TE65

The training set was used to train different BNN classifiers with various numbers of hidden nodes. Ten networks were trained with various randomly starting weights and biases for a certain number of hidden nodes, and the log evidence was evaluated. The networks with two hidden nodes can produce the highest log evidence, as seen in Figure 6. According to Figure 7, this network architecture can also provide the highest overall accuracy of fault classification.

Figure 6.
Log evidence versus the number of hidden nodes (Rogers ratios).

Figure 7.
Overall accuracy in relation to the number of hidden nodes (Rogers ratios).

In Table 7, the number of well-determined parameters and the change in four hyper-parameters are shown. The optimized BNN’s confusion matrix is shown in Table 8, and its overall accuracy in classifying faults from unknown input vectors is 80.56%. Table 9 compares recommended limit and BNN-based DGA techniques using the same training data set. The recommended limit-based solutions are obviously much outclassed by the BNN-based methods.

Period	ξ1	ξ2	ξ3	ξ4	γ
1	0.026	0.012	0.009	0.268	18.645
2	0.039	0.015	0.007	0.353	16.315
3	0.053	0.02	0.005	0.333	15.801

Table 7.

Four hyper-parameter variations and the number of well-determined parameters based on hyper-parameter re-estimation times (Rogers ratios).

		Predicted classification
True classification	Fault	PD	D1	D2	T1&T2	T3
	PD	2	0	0	2	0
	D1	0	4	4	0	0
	D2	0	0	12	0	0
	T1&T2	0	0	0	5	1
	T3	0	0	0	0	6
Accuracy (%)	80.56

Table 8.

Confusion matrix of the trained BNN for identifying unknown input vectors (Rogers ratios).

Doernenburg ratio approach corresponding to suggested gas ratio limits	79.48 (%)
Doernenburg ratio method corresponding to the BNN	83.33 (%)
Rogers ratio method corresponding to suggested gas ratio limits	40.17 (%)
Rogers ratio method corresponding to the BNN	80.56 (%)

Table 9.

Overall accuracy comparison of the suggested gas ratio limit and BNN-based classification techniques.

The regulariztion parameters (hyperparameters) and the appropriate number of hidden nodes in the network can be easily determined based on the investigation of the Bayesian inference framework for MLP neural network training. The suitability of a BNN design based on a few nodes in the hidden layer for early fault detection in power transformers is demonstrated. The diagnosis criterion under consideration mostly determines how many hidden units are present. For diagnosing power transformer faults, this study also compares proposed gas ratio limit-based approaches with BNN-based methods. This study’s future work will involve comparing BNNs and other machine-learning classifiers for power transformer DGA.

6. Conclusions

As standard neural networks that have been expanded with the posterior inference, BNNs can be used to avoid over-fitting. The Bayesian inference can thus be used to determine a probability distribution over a hypothetical neural network. To determine the optimum network architecture and ensure high generalization, the normal neural network training approach can be modified appropriately to directly handle the model comparison problem. Finally, we can confirm that the Bayesian technique can be utilized to avoid overfitting, allow network learning with limited and noisy datasets, and notify us of the level of uncertainty in various predictions.

Acknowledgments

The authors would like to extend their heartfelt appreciation to Professor Ian Nabney (University of Bristol, UK) for his assistance throughout the discovery of the open-source Netlab software used for this study.

Conflict of interest

The authors declare no conflict of interest.

References

1. MacKay JC. The evidence framework applied to classification networks. Neural Computation. 1992;4:720-736
2. Bishop CM. Neural Networks for Pattern Recognition. New York: Oxford University Press; 1995
3. Nabney IT. Netlab: Algorithms for Pattern Recognition (Advances in Pattern Recognition). Springer Nature Customer Service Center LLC; 2002
4. Thanh SN, Goetz S. Dissolved gas analysis of insulating oil for power transformer fault diagnosis with Bayesian neural network. JST: Smart Systems and Devices. 2022;32(3):61-68
5. Thanh SN, Johnson CG. Protein secondary structure prediction using an optimized Bayesian classification neural network. In: The 5th International Conference on Neural Computing Theory and Application, Vilamoura, Algarve, Portugal. 20–22 Sep 2013
6. Thanh SN, Nguyen HT, Taylor PB, Middleton J. Improved head direction command classification using an optimized Bayesian neural network. In: Proceedings of the 28th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, New York City, New York, USA. 31 Aug–3 Sep 2006. pp. 5679-5682
7. Thanh SN, Tan HN, Taylor P. Bayesian neural network classification of head movement direction using various advanced optimisation training algorithms. In: Proceedings of the First IEEE/RAS - EMBS International Conference on Biomedical Robotics and Biomechatronics, Pisa, Italy. 20–22 Feb 2006. pp. 1-6
8. Thanh SN, Tan HN, Taylor P. Hands-free control of power wheelchairs using Bayesian neural networks. In: Proceedings of the IEEE Conference on Cybernetics and Intelligent Systems, Singapore. 1–3 Dec 2004. pp. 745-759
9. Nguyen HT, Ghevondian N, Thanh SN, Jones TW. Detection of hypoglycaemic episodes in children with type 1 diabetes using an optimal Bayesian neural network algorithm. In: Proceedings of the 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Lyon, France. 23–26 Aug 2007
10. Nguyen HT, Ghevondian N, Thanh SN, Jones TW. Optimal Bayesian neural-network detection of hypoglycaemia in children with type 1 diabetes using a non-invasive and continuous monitor (HypoMon). In: The American Diabetes Association’s 67th Scientific Session, Chicago, Illinois, USA. 22–26 Jun 2007
11. Marwala T. Scaled conjugate gradient and Bayesian training of neural networks for fault identification in cylinders. Computers & Structures. 2001;79(32):2793-2803
12. Thodberg HH. A review of Bayesian neural networks with an application to near infrared spectroscopy. IEEE Transactions on Neural Networks. 1996;7(1):56-72. DOI: 10.1109/72.478392
13. Tito EH, Zaverucha G, Vellasco M, Pacheco M. Bayesian neural networks for electric load forecasting. In: Proceedings. The 6th International Conference on Neural Information Processing, Denver, Colorado, USA. 29 Nov–4 Dec 1999. pp. 407-411. DOI: 10.1109/ICONIP.1999.844023
14. Charalambous C. Conjugate gradient algorithm for efficient training of artificial neural networks. In: IEE Proceedings G (Circuits, Devices and Systems). Vol. 139, No. 3. The Institution of Electrical Engineers; 1992. pp. 301-310
15. Møller MF. A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks. 1993;6(4):525-533

[1] 1. MacKay JC. The evidence framework applied to classification networks. Neural Computation. 1992;4:720-736

[2] 2. Bishop CM. Neural Networks for Pattern Recognition. New York: Oxford University Press; 1995

[3] 3. Nabney IT. Netlab: Algorithms for Pattern Recognition (Advances in Pattern Recognition). Springer Nature Customer Service Center LLC; 2002

[4] 4. Thanh SN, Goetz S. Dissolved gas analysis of insulating oil for power transformer fault diagnosis with Bayesian neural network. JST: Smart Systems and Devices. 2022;32(3):61-68

[5] 5. Thanh SN, Johnson CG. Protein secondary structure prediction using an optimized Bayesian classification neural network. In: The 5th International Conference on Neural Computing Theory and Application, Vilamoura, Algarve, Portugal. 20–22 Sep 2013

[6] 6. Thanh SN, Nguyen HT, Taylor PB, Middleton J. Improved head direction command classification using an optimized Bayesian neural network. In: Proceedings of the 28th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, New York City, New York, USA. 31 Aug–3 Sep 2006. pp. 5679-5682

[7] 7. Thanh SN, Tan HN, Taylor P. Bayesian neural network classification of head movement direction using various advanced optimisation training algorithms. In: Proceedings of the First IEEE/RAS - EMBS International Conference on Biomedical Robotics and Biomechatronics, Pisa, Italy. 20–22 Feb 2006. pp. 1-6

[8] 8. Thanh SN, Tan HN, Taylor P. Hands-free control of power wheelchairs using Bayesian neural networks. In: Proceedings of the IEEE Conference on Cybernetics and Intelligent Systems, Singapore. 1–3 Dec 2004. pp. 745-759

[9] 9. Nguyen HT, Ghevondian N, Thanh SN, Jones TW. Detection of hypoglycaemic episodes in children with type 1 diabetes using an optimal Bayesian neural network algorithm. In: Proceedings of the 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Lyon, France. 23–26 Aug 2007

[10] 10. Nguyen HT, Ghevondian N, Thanh SN, Jones TW. Optimal Bayesian neural-network detection of hypoglycaemia in children with type 1 diabetes using a non-invasive and continuous monitor (HypoMon). In: The American Diabetes Association’s 67th Scientific Session, Chicago, Illinois, USA. 22–26 Jun 2007

[11] 11. Marwala T. Scaled conjugate gradient and Bayesian training of neural networks for fault identification in cylinders. Computers & Structures. 2001;79(32):2793-2803

[12] 12. Thodberg HH. A review of Bayesian neural networks with an application to near infrared spectroscopy. IEEE Transactions on Neural Networks. 1996;7(1):56-72. DOI: 10.1109/72.478392

[13] 13. Tito EH, Zaverucha G, Vellasco M, Pacheco M. Bayesian neural networks for electric load forecasting. In: Proceedings. The 6th International Conference on Neural Information Processing, Denver, Colorado, USA. 29 Nov–4 Dec 1999. pp. 407-411. DOI: 10.1109/ICONIP.1999.844023

[14] 14. Charalambous C. Conjugate gradient algorithm for efficient training of artificial neural networks. In: IEE Proceedings G (Circuits, Devices and Systems). Vol. 139, No. 3. The Institution of Electrical Engineers; 1992. pp. 301-310

[15] 15. Møller MF. A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks. 1993;6(4):525-533