Exploring the Interrelationship of Risk Factors for Supporting eHealth Knowledge-Based System Exploring the Interrelationship of Risk Factors for Supporting eHealth Knowledge-Based System

In developing countries like Africa, the physician-to-population ratio is below the World Health Organization (WHO) minimum recommendation. Because of the limited resource setting, the healthcare services did not get the equity of access to the use of health services, the sustainable health financing, and the quality of healthcare service provision. Efficient and effective teaching, alerting, and recommendation system are required to support the activi - ties of the healthcare service. To alleviate those issues, creating a competitive eHealth knowl-edge-based system (KBS) will bring unlimited benefit. In this study, Apriori techniques are applied to malaria dataset to explore the degree of the association of risk factors. And then, integrate the output of data mining (i.e., the interrelationship of risk factors) with knowledge- based reasoning. Nearest neighbor retrieval algorithms (for retrieval) and voting method (to reuse tasks) are used to design and deliver personalized knowledge-based system.


Introduction
In Africa, on average there are nine hospital beds per 10,000 people in comparison to the world average of 27. In sub-Saharan Africa, the physician-to-population ratio is the lowest in the world [1,2]. Countries like Ethiopia set a strategic plan to improve access and equity to preventive, essential health interventions at the village and household levels to ensure healthcare coverage in rural areas [3,4]. quality of healthcare service provision. The physician-to-population ratio is below the World Health Organization (WHO) minimum recommendations [1,2]. Still, pneumonia, diarrhea, acute upper respiratory tract infection, acute febrile illness, and malaria account for 64% under five morbidity [5].
About 68% of the country's total population living in areas at risk of malaria, 75% of the country is vulnerable to malaria (defined as areas <2000 m, those areas are fertile and suitable for agriculture and accounts for up to 17% of outpatient consultations, 15% of admissions and 29% of inpatient deaths. [6,7] On the other hand, it has been more than four decades the computer program reasons (e.g., MYCIN was developed in the early 1970s) and uses knowledge to assist the domain experts and minimize routine activities. To prevent and control the crisis of malaria, different scholars and responsible bodies have made remarkable efforts by conducting researches, implementing strategies, and policies [6,7]. A predictive data mining model has been constructed using Ethiopian WHO malaria database, metrological database, and national mapping database [8].
The model is accurate to determine the occurrence of death, and it is good enough to identify the cases. However, the system is looking a mechanism to assist routine healthcare activities, administrative and medical cost, demographic challenges, and equitable health distribution. For instance, the health extension workers (HEWs) assist peripheral health services by bridging the gap between the communities and health facilities [3,4]. Each kebele has two HEWs responsible for providing outreach services. A kebele is the smallest governmental administrative unit and on average has a population of 5000 people. The HEWs teach the community house to house for each and every person in the kebele in order to create and promote healthy lifestyles. In all, the healthcare system is searching a mechanism for teaching, alerting, and recommendation system to support the daily routine activities.
To alleviate those issues, creating a competitive eHealth knowledge-based system (KBS) is the main goal of this work which will bring unlimited benefit in low resource setting. As a case study, we choose malaria (malaria dataset) because malaria prevention and control at the community level face numerous challenges because of the climate condition (temperature, rainfall), epidemiological, and genetic, poverty, malaria outbreak, over prescription for positive result and so on. Knowing the pattern and interrelationship of risk factors is important for supporting knowledge-based system as well as prediction of malaria death occurrences/ cases. An attempt is made for exploring the degree of association between malaria risk factors (related to the malaria death occurrence and case identification). Investigating the degree of interrelationship among risk factors will have a great contribution toward eradicating the outbreak of malaria. The outcome of the study helps to mitigate the severity through investigating the association of risk factors and building of a competitive knowledge-based system. learning, understanding, emotions, consciousness, intuition and creativity, language capacity, etc.). On the one hand, KBS is advantageous when there is shortage of expert, decisionmaking for problem-solving needs an intelligent assistant, expertise is needed to be stored for future use, and so on. On the other hand, KBS faces a lot of challenges due to the abstract nature of knowledge, limitation of cognitive science, and other scientific methods [9,10].
Knowledge representation and inference engine are the two building blocks of KBS. The knowledge acquired from experts, documents, books, and other resources has been organized using knowledge representation. The inference engine gets the knowledge and instructs how to use the knowledge to solve problems using rule-or case-based reasoning. Rule-based reasoning is a technique that reasons out about a problem based on the knowledge that is represented in the form of rules [11]. Case-based system represents situations or domain knowledge in the form of cases, and it uses case-based reasoning technique to solve new problems or to handle new situations [12].
Knowledge-based system (in health and medical domain) has made a remarkable effect through providing a reliable diagnostic and cost-effective service. Several systems have been implemented in different medical areas like cancer therapy, infections, blood diseases, general internal medicine, glaucoma, and pulmonary function tests [13,14]. Such systems can be designed to exhaustively consider all possible diseases in a domain, which could outperform human experts to achieve a rapid and accurate diagnosis. Integrating and updating domain knowledge with knowledge discovery are relevant to increase the interestingness and user belief (such as matching discovered pattern with existing knowledge) [15]. Integrated eHealth knowledge-based system based on acquiring health knowledge will support users in exchange of knowledge and accessibility for the users through data collection, care documentation, and knowledge extraction [16].
Following and implementing a hybrid (integrated) intelligent system for medical data classification is good to produce effective knowledge-based system [17]. A promising result is scored through integration to improve the quality of knowledge-based system [17,18]. For instance, integrating the result (rule) of the PART classification algorithm with the knowledge-based system has delivered a favorable result for the diagnosis and treatment of visceral leishmaniasis [19]. The paper by Seera and Lim has experimented and used fuzzy min-max neural networks to learn incrementally from sample data, classification, and regression tree for prediction and random forest model to achieve high classification performance [17]. Kerdprasop and Kerdprasop also tried to automate data mining model by focusing on post data mining process to step of automatic knowledge deployment using induced knowledge and formalization classification rules [20].
However, the need to work more in providing explanatory rules and handling missing data in real-world application is expected. A mechanism for handling irrelevant rule (result) is required in case of inductive experiment system and so on. To alleviate those issues, in our case study, we tried to explore the pattern and the interrelationship of risk factors for supporting the knowledge-based system as well as prediction of malaria death occurrences/cases. The result will increase the interestingness and belief of eHealth knowledge-based system which will bring unlimited benefit in low resource setting.

Research aim and objectives
Investigating the potential applicability of exploring the interrelationship of risk factors using data mining to create a competitive eHealth knowledge-based system is the main goal of this work.

Methodology
Cross industry standard process for data mining (CRISP-DM) methodology is adopted to investigate the interrelationship of malaria risk factors. Then, to design eHealth knowledgebased system, nearest neighbor retrieval algorithms (for retrieval) and voting method (to reuse tasks) are used. The technique is easy in exploring relevant cases and provides an opportunity to retrieve partially matching cases [21][22][23][24].
The malaria data is collected from a zonal health facility in each of the 86 zones of Ethiopia.
To understand the problem domain, we used observation, interviewing with experts and data managers and reviewing documents, reports, and literatures. This helps us to select and integrate decisive attributes from different sources. The data selected from the WHO (World Health Organization) database is integrated with the decisive attributes (like temperature, rainfall, and altitude) extracted from the Ethiopian National Meteorological Agency and Mapping Agency in order to find the association of risk factors.
Exploratory data analysis is performed to get familiar with the data and prepared for investigating the degree of interrelationship. The data mining task is finding the internal association between data elements that will determine the occurrence of death/case. To maintain the quality of the data, preprocessing tasks such as data cleaning (handling missing values, noisy, and outer values), data integration tasks, and data transformation tasks are performed.
The collected Ethiopian WHO malaria database contains five basic attributes (more than 37, 000 records) that provide information about the geographic location and period of coverage. These attributes include country name, region (administrative regions from which malaria information is collected), zone and health facility name, year, and month. The detailed information of the attributes is categorized based on the WHO standards and explicitly represents the detailed information about malaria in each zone of the region across Ethiopia. These categories contain age (less than, equal, or greater than 5 year), malaria type (P. vivax and P. falciparum), cases (inpatient and outpatient), inpatient cases (cases and deaths), severe anemia (inpatient malaria cases less than 5 years and greater than 5 years), and uncomplicated lab-confirmed malaria less than 5 years and greater than 5 years (P. vivax outpatient cases and P. falciparum outpatient cases). Each attribute is preprocessed and statistically summarized into address, patient profile, weather, and altitude. For example, Table 1 presented the statistical summary of uncomplicated malaria less than 5 years of lab-confirmed Plasmodium falciparum.
In order to extract hidden patterns and relationships within the data, from the initial dataset, a number of attributes are constructed. As shown in Table 2, from malaria with severe anemia, attributes such as age, malaria type, cases, and malaria visits, as well as the number of cases and deaths, are constructed.
Summary of the datasets compiled for association rule discovery with their possible nominal values and description is depicted in Figure 1. The malaria dataset used for this study consists Table 1. Summary of uncomplicated malaria less than 5 years. Table 2. Malaria with severe anemia and list of attributes constructed.

Apriori method
The Apriori algorithm (a well-known association rule discovery method) takes a dataset with a list of items that can be easily transformed into a transaction form by creating an item for each attribute value pair that exists in the dataset [25][26][27]. Minimum support and minimum confidence thresholds are also defined to enable Apriori algorithms identify frequent items that are strongly associated. Table 3 presented the step-by-step procedure to mine and extract frequent items using Apriori methods.
Given a support threshold (S), sets of X items that appear in greater than or equal to S baskets are called frequent item sets. Find all rules on item sets of the form X→Y with minimum support and confidence. For example, if-then rules about the content of the baskets {i1, i2,…,ik} → j means "if a basket contains all of i1,…,ik then it is likely to contain j." A typical question of Apriori is to "find all association rules with support ≥ S and confidence ≥ C." In Table 3. Apriori methods.
eHealth -Making Health Care Smarter general, support of an association rule is the frequency of occurrence of the set of items it mentions, and confidence of this association rule is the probability of j given i1,…,ik. It is the number of transactions with i1,…,ik containing item j. This will measure the strength of associations between i1, i2,…,ik, and j.
The key concepts are frequent item sets (the set of items which have minimum support, denoted by Li for ith-item set), a priori property (any subset of frequent item set must be frequent), and join operation (to find Lk, a set of candidate k-item sets are generated by joining Lk-1 with itself). Once frequent item sets are obtained, it is straightforward to generate association rules with confidence larger than or equal to a user-specified minimum support and minimum confidence. The next top quality of the Apriori algorithm is to implement its achievement of good performance by reducing the size of candidate sets that are considered and selected for frequent k-item set [28].
A class implementing an Apriori-type algorithm iteratively reduces the minimum support until it finds the required number of rules with the given minimum confidence [29].
For mining Weka (Waikato Environment for Knowledge Analysis), knowledge discovery tool using Java is used. In Weka 3.7.3, if class association rule (car) property is enabled, the class association is mined instead of (general) association rules. Class classification generates rules that are frequently happening to the probable occurrence of malaria cases. In many studies, associative classification has been found to be more accurate than some traditional classification methods, such as C4.5 [25]. Associative classification can search strong associations between frequent patterns (conjunctions of attribute-value pairs) and class labels. Because association rules explore highly confident associations among multiple attributes, this approach may overcome some constraints introduced by decision tree induction, which considers only one attribute at a time. In all, in association of rule mining, finding all the rules that satisfy both a minimum support and a minimum confidence threshold is important so as to generate strong and interesting rules from the frequent patterns.

eHealth knowledge-based system
Knowledge-based systems are computer programs that try to solve problems in a human expertlike fashion by using knowledge about the application domain (knowledge base) and problemsolving techniques (inference method). The rule-based reasoning technique can be used with other reasoning techniques in order to make a knowledge-based system more efficient. For example, case-based and rule-based reasoning can be used together. Rule-based system is an example of knowledge-based system that uses rules for knowledge representation and rule-based reasoning for reasoning techniques. The development of knowledge-based systems in medical areas has made it possible to provide reliable and thorough diagnostic services with a minimum cost. Such systems can be designed to exhaustively consider all possible diseases in a domain, which could outperform human experts to achieve a rapid and accurate diagnosis. Several systems have been implemented in different medical areas like cancer therapy, infections, blood diseases, general internal medicine, glaucoma, and pulmonary function tests [13,14]. Figure 2 presented the detail architecture of the proposed system. In this research we tried to integrate the output of data mining (i.e., the interrelationship of risk factors) with a knowledge-based system. Apriori algorithm using CRISP-DM methodology is adopted to create the interrelationship of risk factors and used for knowledge acquisition to develop a knowledge-based system. Nearest neighbor retrieval algorithms (for retrieval) and voting method (to reuse tasks) are used to design the eHealth knowledge-based system. The knowledge is represented using "IF a certain situation holds, THEN take a particular action," and the knowledge acquired (interrelationship of risk factors) from Apriori algorithm are rules. An  The user tries to use the system or initiate queries by selecting profile and address information. Based on the desired location (using region and zone), weather information such as altitude, rainfall, and temperature are filled automatically from external weather API. Then, similarity matching is performed using the new queries to retrieve and recommend the proposed solution. However, if similarity matching is unsuccessful, voting technique is applied to select the relevant cases. Finally, to select or recommend the solution, the domain expert will evaluate and validate the new case. The knowledge-based system will use the validated case for future purpose.

Experimental results and discussions
The study tried to explore the interrelationship of risk factors for supporting eHealth knowledge-based system. We have used malaria dataset as a case study to discover the association among the various malaria risk factors using associative rule discovery data mining, and we integrate it with the eHealth knowledge-based system.

Experimental setup
A general and a class association rules are used to discover interesting association patterns. A total of 120 experiments executed using Apriori algorithm (60 experiments using general association rule and 60 experiments for class association rule mining) as depicted in Table 4.
The confidence level is the most important parameter to attain the required objective. By considering this, the experiment is done at different confidence levels of 100, 90, 80, 70, 60, and 50%. Each confidence level is also experimented with a lower bound support of 10-100%. In both scenarios the min support of the upper bound is 100%.

Generated association rules
From the experiment, we observed that the class association mining supports the rules generated in general association mining. It also discovers interesting interrelationship (with 100% confidence level) related with the type of visit, age, altitude, temperature, and malaria type. With 100% confidence level and 60% support, outpatient cases are more closely related to the undetermined occurrence of death specifically when the age group of malaria patient is greater than 5. On the one hand, the result noted that occurrence of death is mostly related to outpatient case instead of the inpatient one. This shows that health workers offer great attention and intensive care for inpatient visits. On the other hand, most of outpatient visits are uncomplicated lab-confirmed malaria. However, the occurrence of death is undetermined and probable when the type of malaria visit is outpatient and the age of the patients is greater than 5. This may be because of lack of qualified health workers and the patients are not properly prescribed as confirmed by Ndiaye et al. [30] in Senegal that the lay health workers made negative diagnostic test. Table 5 illustrates the summary of experimental results. It is difficult to determine the occurrence of death for outpatient cases, and the experimental result revealed that the occurrence of death is related with the increment of malaria case. Roca-Feltrer et al. [31] noted that the increment of malaria cases is related with the transmission intensity, seasonality, and age that lead to a probability of occurrence of deaths. For instance, the experimental result in west Gojjam (specifically in November) supports the probability of occurrence of deaths related with the increment of malaria cases.

Discussions
Knowing the seasonality of malaria helps to provide proper intervention to eradicate occurrence of death and cases [31]. Roca-Feltrer et al. [32] relate the transmission intensity, seasonality, and the age pattern of malaria and confirm that younger age groups are with increasing transmission intensity. Our experimental result also confirmed that occurrence of death is undetermined if the altitude is 1500-2000 m and when the age of the person is greater than 5 with a confidence level of 62 and 60%, respectively. This happens because of high transmission intensity. With 100% confidence level, if the type of malaria visit is outpatient and age is below 5, it is difficult to predict occurrence of deaths. Interestingly, occurrence of malaria death is related with severe anemia rather than pregnancy. As discussed by Knoblauch et al.
• If temperature is between 15 and 200°C, the type of malaria visits is outpatient, and the type of cases is uncomplicated lab-confirmed malaria, then the occurrence of death is undetermined.
• If the type of malaria is PV, then occurrence of death is undetermined, and if the type of malaria is PF, then occurrence of death is undetermined.
• If age is under 5 years and the type of malaria visits is outpatient, then occurrence of death is undetermined.  [33], anemia is prevalent in the 6-to 59-month-old children, and the association of anemia with a child age, underlying with iron requirements, is related to growth rate, and hence iron demand declines with age. Further, the algorithm associated (with 100% confidence level) with the type of malaria is unknown for inpatient malaria visits.
However, some unexpected or interesting interrelationship is prevailed such as with 100% confident level, and it is difficult to determine the occurrence of death for both PV and PF malaria types. This needs further investigation to verify whether it is unrelated or expected.
In all, the study presents the association of malaria risk factors using climate, elevation, location, type of malaria, type of malaria visits, number of cases, and death attributes. Both general and class association minings are done using Apriori techniques for discovering the association or patterns between the occurrence of deaths with the type of cases and malaria visits. And then, integrate the output of data mining (i.e., the interrelationship of risk factors) with knowledge-based reasoning. Nearest neighbor retrieval algorithms (for retrieval) and voting method (to reuse tasks) are used to design and deliver personalized knowledge-based system.

Evaluations
The evaluation of the result is executed by combining both an expert and testing tool approaches. An overall measure of pattern values, combining novelty, usefulness, and simplicity, to achieve a predefined goal is evaluated to measure the interestingness of the interrelationship. We have used different multitudes of measurement in the evaluation such as accuracy, support level, confidence, confidence level, and complexity with a 10fold cross validation. We adopted the four measures such as sensitivity, specificity, prediction accuracy, and precision to evaluate the correctness of interrelationship and validate the system through performance testing. Tenfold cross validation is used in the experiment to predict the error rate [34,35]. The basic measure is accuracy, which computes the percentage of correctly classified instances in the test set. The accuracy of a test compares how close a new test value is to a value predicted by if-then rules [36].
The interrelationship of risk factors (association rules) was evaluated in terms of the number of rules and meaning of patterns generated at different minimum support and confidence thresholds for measuring interestingness of the rules. Association (interrelationship) was analyzed in terms of different criteria. The criteria include the number of rules generated at different minimum support and confidence thresholds. The minimum support and confidence thresholds varied from 0.1 to 1 and 0.5 to 1, respectively. As depicted in Table 4, at 90% confidence level with min support of 60 and 20%, the techniques generate 2 and 10 rules, respectively. Furthermore, we investigate the following indicators of the quality of the rule ranking induced by the interestingness measures of the mining algorithm in the average rank of the first rule that covers a test instance and the average rank of the first rule that covers and correctly predicts a test instance.
The performance of eHealth knowledge-based system prototype is evaluated using test case.
Thus, the effectiveness of the retrieval process of eHealth knowledge-based system reasoning is measured by using recall and precision. Precision and recall are useful measures of retrieval performance [34]. Recall is the percentage of relevant cases for the query (new case) that are retrieved, whereas precision is the percentage of retrieved cases that are relevant to the query [34,36,37]. Accuracy is used to measure the performance of the reuse process [34,36].
The case similarity testing shows that when the query is made up of attribute values that have the same value with the case from the case base, the result of the global similarity becomes 1.0. But when there is a difference in the attribute values of the query and the case in the case base, the global similarity value decreases. Therefore, adding cases in the case base improves the performance of knowledge-based reasoning system in solving problems (new cases).
The nearest neighbor algorithm, which is used to develop the retrieval process of the prototype, uses distance to compute the similarity between the query and cases by representing the cases in N dimension vector. However, the recommendation doesn't have clear boundaries as it has subjectivity and depends on the experience of the domain experts as tested and adopted in [19,23,24]. In addition, the importance value that is assigned to the attributes of the case structure is done manually with the help of the domain experts, as there is no research that is conducted for the importance value of the attributes in malaria case management. This could affect the result of the retrieval and the reuse performance of the prototype. However, it needs user acceptance testing (using measuring usability with the system usability scale) in real-world scenarios to measure whether the potential users would like to use the proposed system frequently or not. So that, eHealth knowledge-based system for retrieving relevant cases and proposing solution will attain promising user acceptance, accuracy, and domain expert evaluation.

Conclusion and future work
The experimental result presents the association of risk factors (with relation to the malaria occurrence of death and type of case identification in Ethiopia) using climate, elevation, location, type of malaria, type of malaria visits, number of cases, and death attributes. Both general and class association minings are done using Apriori techniques for discovering the association or patterns of risk factors. The results noted the existence of strong association between occurrence of deaths, type of malaria visits, age, and type of cases. More interestingly, it discovers occurrence of malaria deaths, which are mostly related with severe anemia cases rather than pregnancy. It is also important to precede usability and user acceptance testing of eHealth knowledge-based system in real time and perform testing to compare and contrast with domain experts. So, health institutions have to give great attention to provide the necessary diagnosis and treatment for anemia, especially in regions that are more vulnerable for malaria. It also provides a significant contribution to design an optimal strategy in support of malaria prevention and control program within the country.