Summary table among some solutions.

## Abstract

Bayesian networks can be built based on knowledge, data, or both. Independent of the source of information used to build the model, inaccuracies might occur or the application domain might change. Therefore, there is a need to continuously improve the model during its usage. As new data are collected, algorithms to continuously incorporate the updated knowledge can play an essential role in this process. In regard to the continuous learning of the Bayesian network’s structure, the current solutions are based on its structural refinement or adaptation. Recent researchers aim to reduce complexity and memory usage, allowing to solve complex and large-scale practical problems. This study aims to identify and evaluate solutions for the continuous learning of the Bayesian network’s structures, as well as to outline related future research directions. Our attention remains on the structures because the accurate parameters are completely useless if the structure is not representative.

### Keywords

- Bayesian network
- structure learning
- continuous learning
- structural adaptation
- structural refinement

## 1. Introduction

Bayesian networks (BNs) are probabilistic graphs used to deal with the uncertainties of a domain [1]. These graphs represent the random variables of this domain and their conditional dependencies. The use of Bayesian networks, also known as Bayesian belief networks, has several points to highlight. Among them, stands out the explicit treatment of uncertainty, the ease of estimating the state of certain variables given some evidence, as well as having support methods for decision analysis and quick responses by the user [2].

The application domains of Bayesian networks have been extensive [3]. A large number of applications are in the field of medicine [4, 5], being one of the most addressed. There are also applications in the field of forecasting [6], control [7], and modeling for human understanding [8]. In the context of software engineering, fields such as project planning [9], risk management [10], and quality management [11] are addressed. Motivated by the extensive application cited, methods to improve the construction of these graphic models have become a focus of research.

A Bayesian network is defined by a directed acyclic graph (DAG) and a set of parameters for this DAG (NPT). Therefore, in order to build a Bayesian network, the definition of both the graph and the NPT must be considered. Several researches are being carried out with the intention of assisting these definitions [12, 13, 14]. However, the solutions proposed are based, for the most part, on the batch process. This process is infeasible in some application domains. Companies, for example, are increasingly storing huge databases with knowledge about their business processes. New knowledge is acquired every time. It is virtually impossible to achieve a highly accurate description of the processes involved without new data being collected or a large amount of data being stored that cannot be analyzed at once. Therefore, the need arose for solutions that continuously incorporate the updated knowledge to prior knowledge.

Ref. [15] investigated the main continuous learning solutions proposed until the development of his study. A comparative analysis between incremental algorithms and an experiment to support this analysis were performed. However, extensions of these, as well as new studies, have since been developed. This chapter aims to describe and analyze existing solutions for continuous learning of Bayesian network structures. A systematic review of the literature is carried out, and the algorithms found are divided into two groups according to their concepts: refinement and structural adaptation. Some guidelines for future research are also described.

## 2. Learning Bayesian networks

In probability theory, a domain

In many domains, there is a high number of variables

The first component,

The chain rule of probability, Eq. (1), can be rewritten as Eq. (2) based on conditional independence rule. Two sets of variables

In Eq. (2),

The second component,

In this case, it is desired to calculate the likelihood of a person having lung cancer given the history of cancer in their family and if this person is a smoker. The node probability tables (NPTs) of the parent nodes represent a prior knowledge of these variables. The NPT of the child node represents the likelihood of a person having cancer given each possible combination of values of the parent nodes.

The goal of the learning process of a Bayesian network is to find a network (or only its structure) that best encodes the joint probability distribution of a domain. Bayesian network learning can be stated as [18]:

**Definition 1 (Bayesian network learning):***Given a data set, infer the topology for the belief network that may have generated the data set together with the corresponding uncertainty distribution.*

The learning problem of Bayesian networks can be decomposed into two subproblems: construct the structure, that is, DAG, and define the NPT [19]. Although there are many studies related to the importance of learning NPTs, this study focuses on the first subproblem described. The accurate parameters are completely useless if the structure is not representative. In [20], the importance of the structure of a network in the independence and relevance relationships between the variables concerned is described. Also, in [20], an analysis about the influence of probabilistic networks in the difficulty of representing the uncertainties present in the domain is presented.

The structure can be constructed from data only using machine learning or search techniques, such as those presented in [12, 21, 22, 23]. To optimize the definition of structure, the data can be enhanced with expert knowledge. One approach is to consult experts about the posterior probabilities of the structure to reduce the search space, such as presented in [13, 24].

In [25, 26], solutions are presented to complement a Bayesian network with the knowledge of domain experts through the addition of new factors in the model. In this way, it is possible to predict rare events, often not represented in the available databases.

Finally, the structure can be defined only according to the knowledge of specialists, where it is assumed that there are no data available before the structure construction process. In this case, the structure can be defined according to the elicited knowledge of one or multiple experts, as presented in [27, 28].

Most of the solutions to previously reported problems, as well as all of the solutions cited so far, operate as a batch process. The batch process (or batch learning) can be summarized in the delivery of a block of knowledge to an algorithm so that it learns a structure. All the knowledge available to date is used during this process. However, it is inevitable that such information will be inaccurate during the modeling of the domain [29]. For example, if knowledge is acquired from domain experts, the lack of communication between this expert and the expert on graphical models may result in errors in the Bayesian network. Similarly, if the network is being built from a data set, the data set may be inappropriate or inaccurate.

On the other hand, it is neither efficient nor, in some cases, possible to always keep the stored data in search of more representative models using batch learning algorithms [30]. To improve the use of data in the learning problem of Bayesian network structures, it’s required solutions that present a continuous process of learning. In the following section, solutions for the continuous learning of the Bayesian network structures are presented.

## 3. Continuous learning of Bayesian networks’ structure

The incentive in the application of processes that realize the learning of Bayesian networks in stages was based, initially, on the observation of the human learning by some researchers. However, the paradigm shift that provided multiple domains generated and stored more and more data also propelled its development. Continuous (or incremental) learning approaches have some widely accepted definitions found in the literature [31].

In [32], the following precise definition was stated.

**Definition 2 (Incremental learner):***A learner**is incremental if**inputs one training experience at a time, does not reprocess any previous experiences, and retains only one knowledge structure in memory.*

In this definition, there are three constraints so that an algorithm can be classified as incremental. In Ref. [32], another definition with a different way of knowledge maintenance was presented.

**Definition 3 (Incremental procedure):***A Bayesian network learning procedure is incremental if each iteration**, it receives a new data instance**and then produces the next hypothesis**. This estimate is then used by to perform the required task on the next instance**, which in turn is used to update the network and so on. The procedure might generate a new model after some number of**instances are collected.*

This definition relaxes the constraints imposed by Definition 2. For Definition 3, an incremental algorithm can be allowed to process at most

**Definition 4 (Incremental algorithm):***An incremental algorithm should meet the following constraints: (i) it must require small constant time per record; (ii) it must be able to build a model using at most one scan of the data; (iii) it must use only a fixed amount of main memory, irrespective of the total number of records it has seen; (iv) it must make a usable model available at any point in time, as opposed to only when it is done with processing the data; and (v) it should produce a model that is equivalent (or nearly identical) to the one that would be obtained by the corresponding batch algorithm.*

Like previous definitions, this definition imposes constraints related to time, memory, and knowledge addressed. The Definition 4 increments the constraint related to the availability of a useful model imposed by Definition 3. Now, a lot due to its application in data streams [34], it is needed to make a usable model available at any point in time, as opposed to only when it is done with processing the data.

Based on the aforementioned definitions of an incremental learning algorithm and in [31], solutions were found that present different learning methodologies. Two groups separate these solutions. The main difference between them is in how they use the acquired knowledge. In one of these groups, denoted by refinement solutions, the data are used according to the knowledge already possessed. This knowledge is maintained in the probabilistic graphic already developed, being only refined with the new data. On the other group, denoted by structural adaptation solutions, the solutions maintain one or more candidate structures and apply to these structures the observations received. This new data set is used to update the sufficient statistics needed to build that candidate structures.

The concepts and information about the type of solutions found in this research are mapped, in an outlined (due to space constraints) and schematic way, in Figure 2.

### 3.1. Methodology

The continuous learning Bayesian networks structure is kept like an open problem in many application domains. In this study, a systematic literature review is used to identify and evaluate solutions for the continuous learning of the Bayesian networks’ structures, as well as to outline related future research directions. A combination of strings was used for title and keyword to identify articles related to continuous learning. Scopus is used as an electronic database.

In the initial search, 4150 items from Scopus were found, but only the first 400 results were checked. This stop was performed because, of these first 400 results, sorted by relevance, a sequence of 150 articles totally unrelated to the search was found. To verify this relationship, three reading steps were performed. Initially, articles were selected considering only the title and abstract. A superficial reading of the remaining articles was then performed. This step consisted of reading and interpreting section titles, figures, graphs, conclusions, and other elements. In the remaining articles, a critical reading was carried out seeking to interpret and analyze the complete text. The following sections present a description of these efforts.

### 3.2. Buntine’s solution

In [35], a theory refinement-based approach has been proposed. The key task of theory refinement is to update the initial partial theory, usually the expert’s prior domain knowledge, as far as new cases produce a posterior knowledge about the space of possible theories, being one of the fundamentals of continuous learning. Given a set of new data and total ordering of the domain variables, the solution updates both the knowledge about the structure and the parameters using different BNs.

For extending and modifying the structure, [35] proposed a batch algorithm that uses the score-and-search–based Bayesian approach. However, using some guidelines presented by the author, it is possible to convert the batch learning into continuous learning process.

The batch algorithm of [35] requires a set of ordered variables

A set of alternative parent sets

The batch algorithm also requires tree parameters

Set the tree parameters

To convert this batch into continuous learning algorithm, [35] describes two situations that vary according to the time available for the update. In the case where there is a short amount of time for updating the BNs, the algorithm only updates the posterior probabilities of the parent lattices. To this, it is necessary to store posterior probabilities and the counters

On the other hand, both structure and posterior updates are updated according to new data. For each variable

The generation of different networks needs to update the posterior probabilities of all alive sets of the lattice. This solution uses sufficient statistics of data that only contains counts of different entries in data instead of data entries, requiring constant time to update sufficient statistics only when new records arrive. Furthermore, [33, 35] performs an additional search over the space of alternative Bayesian networks.

### 3.3. Friedman and Goldschmidt’s solution

Like the previous solution, [34] also addressed the problem of sequential update of the prior domain knowledge. Through the use of sufficient statistics maintained in memory for each network structure at a defined frontier, the knowledge is continuously learned. In this way, this solution provides a method that trades off between accuracy, that is, quality of structure, and storage, that is amount of information about the past observations.

In Ref. [34], three different solutions to sequentially learn BNs have been proposed. Among them, there are two extremes. The naive approach, as it is called, stores all the previously seen data and repeatedly invokes a batch learning procedure after each new observation is recorded. However, despite using as much information as possible, thus increasing the quality of the structure generated, this approach has a high storage cost. In addition, reusing batch learning increases the amount of time and processing spent.

On the other hand, the maximum a posteriori (MAP) probability approach uses a model to store all the information that is considered useful for the next steps in the knowledge update. However, the use of a single model can strongly bias the continuous learning of the model and lose information.

Aware of the disadvantages of previous approaches, [34] presents a new approach, called incremental, which proposes a tradeoff between extremes. The incremental approach does not store all data, unlike the naive approach, and it does not use a single network to represent the prior knowledge, unlike the MAP probability approach. Moreover, it allows flexible choices in the tradeoff between space and quality of the induced networks.

The basic component of this procedure is a module that maintains a set

Two structures can easily keep track by maintaining a slightly larger set of statistics, for example, suppose on the deliberating choice between two structures

Generalizing this discussion, the incremental approach can be applied to any search procedure that can define a search frontier. This frontier, denoted by

When this approach is instantiated with the greedy hill climbing procedure, the frontier

Many scoring functions can be used to evaluate the “fitness” of networks with respect to the training data and then to search for the best network. However, the incremental approach collects different sufficient statistics in different moments of the learning process. Thus, they need to compare Bayesian networks with respect to different data sets. This problem happens because, unlike [35, 34] may consider those structures that, previously, were considered as non-promising (the ones that were out of the frontier).

The two main scoring functions commonly used to learn Bayesian networks, Bayesian scores [37] and minimal description length (MDL) [38], are inappropriate for this problem. In order to overcome this problem, [34] proposed an averaged MDL measure

Analyzing both [35, 34] like hill-climbing searchers, they perform, for each node, operations to increase the score of the resulting structure, without introducing a cycle into the network and based on the assumption that they start from an arc-less network, as can be observed in their pseudo-code. Both stop when performing a single operation cannot increase the network’s score. The difference between [35, 34] is the neighborhood composition. While [35] uses only the addition operator to construct neighbors, [34] uses the addition, reversion, and deletion of an arc.

### 3.4. Roure’s solution

The conversions of the batch learning approaches present in [35, 34] in continuous learning approaches paved the way for popular batch algorithms like the B, K2 [37] and HCMC [39] algorithms to be turned into incremental ones [31].

In Ref. [40], two heuristics to change a batch hill-climbing search (HCS) into an incremental hill-climbing search algorithm based on combining of sufficient statistics with reduced search space have been proposed. In the batch version of HCS algorithm, a search on the space called neighborhood is performed to examine all possible local changes that can be made in order to maximize the scoring function.

Similar to the frontier presented in [34], a neighborhood of a model

The heuristics presented in [40] are based on two main problems: when and which part to update, and how to calculate and store sufficient statistics. The first heuristic is called by traversal operators in correct order (TOCO). TOCO verifies the already learned model and its search path for new data. If the new data alter the search path, then it is worth to update an already learned model. The second heuristic is called by reduced search space (RSS). RSS identified when the current structure needs to be revised. At each step of the search path, it stores top

### 3.5. Lam and Bacchus’s solution

In [41], another continuous learning solution based on an extension of batch solution is presented. The batch solution used as base is presented in [42]; however, it will not be presented here because the new solution is not coupled to their batch algorithm. The proposed extension aims to perform a review of the BN structure incrementally as new data about a subset of variables were available.

This revision is done using the structure of the BN as prior probability under the implicit assumption that the existent network is already a fairly accurate model of the database. This assumption is a way to incorporate domain knowledge into the problem; however, the new refined network structure should be similar to the existing one, skewing the process.

The solution of [41] also proved, like [35] and based on the MDL measure, that if partial network structure of the whole structure gets, by changing its topology, better score to scoring function, then the whole network structure be improved if no cycles are introduced. Based on this, [41] developed an algorithm to update the BN by improving parts of it. This algorithm produces a new partial network structure based on new data set and the existing network using an extension of MDL. It then locally modifies the old structure comparing and changing correspond part according to new partial network.

The source data to algorithm consists of two components: the new data and the existent network structure. Considering the MDL principle states, finding a partial network

A simple way to encode an arc is to describe the source node and the destination node.

In order to learn the local structure, the batch algorithm proposed by [42] or other algorithm using the scoring function for each node of the partial structure is presented by Eq. (3).

With the third term of the equation, [41] avoided using the sufficient statistics of the old data.

After the new partial structure is learned, the review process continues with the attempt to obtain a refined structure of lower total description length with the aid of the existent structure

In an attempt to avoid creating cycles during each subgraph substitution, [41] uses best-first search to find the set of subgraph units that yields the best reduction in description length without generating any cycles. In addition, a list

### 3.6. Shi and Tan’s solution

In [43], an efficient hybrid incremental learning algorithm is proposed. All solutions presented so far are score-and-search–based solutions. This solution consists of a polynomial-time constraint-based technique and a hill-climbing search procedure. In this way, this solution provides a hybrid algorithm that offers considerable computational complexity savings and slightly better model accuracy.

The first fragment that composes the solution is based on a constraint-based technique. The purpose of this technique is to select candidate parent set for each variable on data. For each variable

After this, a heuristic procedure called HeuristicIND is proposed to reduce

### 3.7. Others continuous learning solutions

In [45], an improvement in the refinement process of [34] is presented. In this study, an incremental method for learning Bayesian networks based on evolutionary computing, denoted by IEMA, is developed.

The solutions presented so far assume that a stationary stochastic process produces all knowledge, that is, the ordering of the database is inconsequential. However, in many application domains of BNs, such as financial problems [46], the processes vary according to the time, the data are non-stationary or piecewise stationary distributed, which would reduce the adequacy of the solutions already mentioned. In [47], the assumption on stationary data is relaxed and an incremental learning Bayesian network based on non-stationary data domains is developed.

In [48], the streaming data prediction is addressed. A parallel and incremental solution for learning of BNs from massive, distributed, and dynamically changing data by extending the classical scoring and search algorithm and using MapReduce is presented.

Ref. [37] also presents an algorithm for stream and online data, more precisely, data that are privately and horizontally shared among two or more parties. This algorithm is based on an efficient version of sufficient statistics to learning privacy-preserving Bayesian networks.

In [49], an active and dynamic method of diagnosis of crop diseases has been proposed based on Bayesian networks and incremental learning. To incremental learning, a new algorithm for dynamically updating the Bayesian network-based diagnosis model over time also is proposed.

Ref. [50] transformed the local structure identification part of Max-min Hill-climbing (MMHC) algorithm into an incremental fashion by using heuristics and applied incremental hill-climbing to learn a set of candidates-parent-children for a target variable.

In [51], an incremental algorithm for BN structure learning that can deal with high dimensional domains has been proposed.

In [52], the concept of influence degree is used to describe the influence of new data on the existing BN. A scoring-based algorithm for revising a BN iteratively by hill-climbing search for reversing, adding, or deleting edges has been proposed.

In [53], an approach to incremental structure optimize is presented. Based on a specific method, this approach decomposes the initial network into several subnets created from a junction tree developed using information about the joint probability of the network. With some adaptations, it can be used as a continuous learning algorithm.

It is also important to point out that studies are found that denominate their algorithm as incremental, but these algorithms result in incremental learning of variables or small parts of the network as they are available and do not necessarily generate a new model, as it is constrained in the definition.

### 3.8. Comments and future research

Some comments comparing some solutions have already been made during your descriptions. However, to facilitate the understanding of the methodologies used, some techniques and characteristics were compared below. A summary table among some solutions is presented in Table 1.

Solution | Domain expert | Type of structure changes | Scoring function | Search procedure | CI | Stochastic process |
---|---|---|---|---|---|---|

[36] | Prior knowledge | Adding edges | Any | Any local search | — | Stationary |

[34] | Prior knowledge | Adding, deleting, or reversing edges | Averaged MDL | Any local search | — | Stationary |

[40] | Prior knowledge | Adding, deleting, or reversing edges | Any | Hill-climbing search | — | Stationary |

[41] | Prior knowledge | Adding, deleting, or reversing edges | Extension of MDL | Best-first search | — | Stationary |

[43] | Prior knowledge | Adding, deleting, or reversing edges | — | Hill-climbing search | InfoChi | Stationary |

[47] | Prior knowledge | Adding, deleting, or reversing edges | — | Any local search | — | Non-stationary |

Some features of the solutions were discussed in Table 1. These features are important for the differentiation of the proposal of each algorithm. It is noteworthy that, among the outstanding solutions, none proposes to use the domain specialist as a source of posterior knowledge, only as a source of prior knowledge. This knowledge can be increased over time and can be used to improve the built network. In addition, changes that are not able to be identified only with the use of data, such as adding factors and incomprehensibility of the model, can be identified.

When checking the stochastic process, it is noted that only [47] adopts a non-stationary domain, even considering the increasing diversity of domain. The other quoted solutions kept their focus on stationary domains, more common at the time of their developments. Still considering the domain, the effectiveness of the algorithms is rarely validated in a real domain, even if it is done with experiments that use data coming from real domains.

The local search is a standard present among the search procedures used. Despite their high computational complexity, methods were developed so that this was not a constraint and they then continued to be used. Only one solution made use of conditional independence (CI) tests. Ref. [43] developed a new technique that is used as the basis for CI tests. Considering scoring functions, some solutions leave open the use of any function. However, an adaptation to their applications is necessary in sufficient statistics used in some, for example, in [34]. On the other hand, some solutions use the MDL measure, but already adapted, either to reduce computational complexity or to achieve better results in different data set data. Other features can be addressed in future reviews, such as computational complexity, procedure focus, and the type of application domain, among others.

## 4. Conclusions

This chapter presents a survey based on a systematic literature review of continuous learning solutions of Bayesian network structures. It searches articles with an algorithm or approach considered as incremental according to the definitions presented in the text.

The solutions found can be classified as structural adaptation or refinement. The first group, in short, uses the new data set to maintain sufficient statistics that are used to store the existing knowledge about the domain. The second group uses this new set of data to perform a refinement in the network, based on previous knowledge coupled in the old structure.

Finally, the presence of the posterior knowledge of the domain specialist during the incremental learning process and experiments in real domains for the validation of some solutions findings are in the future works.

## Acknowledgments

The authors would like to thank Federal University of Campina Grande in Brazil for supporting this study.

## Conflict of interest

The author declares that there is no conflict of interest regarding the publication of this chapter.