Variables defining the SP4 internal case base
Although knowledge-based systems (KBS), and more generally decision support systems (DSS), represent one of the commercial successes resulting from artificial intelligence (AI) research, their developers have repeatedly encountered several problems covering their whole life cycle (Watson, 1997). In this context, knowledge elicitation as well as system implementation, adaptation and maintenance are non trivial issues to be dealt with. With the aim of overcoming these problems, Schank (1982) proposed a revolutionary approach called case-based reasoning (CBR), which is in effect, a model of human reasoning. The idea underlying CBR is that people frequently rely on previous problem-solving experiences when facing up new problems. This assertion may be verified in many day to day problem-solving situations by simple observation or by psychological experimentation (Klein & Whitaker, 1988). Since the ideas underlying case-based reasoning were first established, CBR systems have been found to be successful in a wide range of application areas (Kolodner, 1993; Watson, 1997; Pal et al. 2000). Motivated by the outstanding achievements obtained, some relevant conferences (i.e. ECCBR
In this chapter we present key aspects related with the application of CBR methodology to the construction of adaptable decision support systems. The rest of the chapter is organized as follows: Section 2 introduces an overview about CBR life cycle and combination strategies for constructing hybrid AI systems. Section 3 introduces and covers the main characteristics of four successful decision support systems developed following CBR principles. Finally, Section 4 summarizes the main conclusions and presents the fundamental advantages of adopting this methodology.
2. CBR life cycle and combination strategies
A case-based reasoning system solves new problems by adapting solutions that were used to solve previous problems (Riesbeck & Schank, 1989). The case base holds a number of cases, each of which represents a problem together with its corresponding solution. Once a new problem arises, a possible solution to it is obtained by retrieving similar cases from the case base and studying their recorded solutions. A CBR system is dynamic in the sense that, in operation, cases representing new problems together with their solutions are added to the case base, redundant cases are eliminated and others are created by combining existing cases.
Every time a CBR system copes with a new problem situation, retrieves previously stored cases together with their solutions, matches them against the new problem context, adapts previous outcomes to provide an answer to the new problem and stores the new solution by adding a novel case in the case base. All of these actions are self-contained and can be represented by a cyclic sequence of processes (see Figure 1) in which human interaction may be needed. A typical CBR system is composed of four sequential steps which are called into action each time a new problem is to be solved (Watson, 1997; Kolodner, 1993; Aamodt & Plaza, 1994): (
Each of the steps comprising the CBR life cycle defined in Figure 1 requires a model or method in order to accurately perform its mission (see Figure 2). The purpose of the retrieval step is to search the case base and select one or more previous cases that most closely match the new problem situation, together with their solutions. The selected cases are reused to generate a solution appropriate to the current problem situation. This solution is revised if necessary and finally, the new case (i.e. the problem description together with the obtained solution) is stored in the case base. In the CBR cycle there is normally some human interaction. Whilst case retrieval and reuse may be automated, case revision and retention are often undertaken by human experts. This is a current weakness of CBR systems and one of their major challenges. As showed in Figure 2, the techniques commonly used for implementing the different stages of a typical CBR system include: knowledge-based systems, artificial neural networks (ANN), genetic algorithms (GA), rule-based systems (RBS), qualitative reasoning (QR), fuzzy systems (FS) and constraint satisfaction problems (CSP).
As the core of CBR systems is its memory, cases should therefore accurately represent both problems and their solutions. In this context, cases may be deleted if they are found to produce inaccurate solutions, they may be merged together to create more generalised solutions, and they may be modified, over time, through the experience gained in producing improved solutions. If an attempt to solve a problem fails and it is possible to identify the reason for the failure, then this information should also be stored in order to avoid the same mistake in the future. This corresponds to a common learning strategy employed in human problem-solving. Rather than creating general relationships between problem descriptors and conclusions, as is the case with rule-based reasoning, or relying on general knowledge of the problem domain, CBR systems are able to utilise the specific knowledge of previously experienced, concrete problem situations. A CBR system provides an incremental learning process because each time a problem is solved, a new experience is retained, thus making it available for future reuse.
As design methodology adequate for constructing DSS, case-based reasoning can be used by itself or as part of another intelligent or conventional computing system. From the last years, there has been an increasing interest in the possibility of integrating different AI techniques with the goal of constructing more powerful and accurate hybrid systems. In this context, the work of Soucek (1991) established the IRIS (
Therefore, given their hybrid nature, CBR systems can be easily combined with other alternatives in order to construct robust decision support systems. These approaches include their successful hybridization with expert systems (Vo & Macchion, 1993; Rissland et al. 1993; Medsker, 1995), fuzzy logic (Xu, 1994; Gui, 1993; Dubois et al. 1997), genetic algorithms (Louis et al. 1993; Oppacher & Deugo, 1991), qualitative reasoning (Navinchandra et al. 1991), constraint satisfaction systems (Maher & Zhang, 1993; Hinrichs, 1992), artificial neural networks (Thrift, 1989; Lim et al. 1991; Liu & Yan, 1997; Corchado et al. 2001) and bayesian networks (Shinmori, 1998; Aamodt y Langseth, 1998; Dingsøyr, 1998; Friese, 1999; Langseth et al., 1999) among others.
Furthermore, case-based reasoning can be a particularly appropriate problem-solving strategy when the knowledge required to formulate a rule-based model of the domain is difficult to obtain, or when the number or complexity of rules relating to the problem domain is too large for conventional knowledge acquisition methods. In this sense, according to the work of Aamodt and Plaza (1994) there are five different types of CBR systems, and although they share similar features, each of them is more appropriate for a particular type of problem: (
EBR systems are especially suitable for classification tasks, in which the category of the most similar past case becomes the solution to the new problem. The set of classes directly constitutes the collection of possible solutions applied without modification. IBR systems are a specialization of exemplar-based reasoning for solving problems in which the instances (cases) are usually very simple (e.g. feature vectors). These systems can be completely automated with no user intervention in their whole life cycle. MBR systems supplement previous approaches with the capacity of parallel processing computation. ABR systems are particularly applicable for solving new problems based on past cases from a different domain. In order to properly work, it should be possible to transfer the solution of a source analogue situation to the present target problem. In typical CBR systems cases are assumed to present a certain complexity level (in terms of their internal structure and the information contained), therefore some modification is needed in order to adapt retrieved solutions when applied to a different problem solving context.
3. Practical applications
The decision support systems covered in this chapter come from four different research areas: industrial planning, biomedical domain, oceanographic forecasting and anti-spam filtering. All the implemented applications are fully designed following the CBR paradigm in order to empower their adaptability and accuracy for solving new problems in their respective fields.
For each domain, we first introduce the target problem to be solved together with the main aspects surrounding each particular situation. A clear description of the representation used for defining the case base is presented and the internal architecture governing each system is explained in detail.
3.1. Industrial planning
Production scheduling is one of the most important functions in a production company. As a consequence, in recent decades various methods have been proposed for the modelling and solution of particular scheduling problems (Akyol & Bayhan, 2007). In the particular case of cooperative poultry farms, the accurate coordination of centralized feed supply (production and distribution) between scattered farms is of utmost importance for both the main feed manufacturer and participating farmers.
In such a situation, some key aspects involving the main participants need to be taken into consideration, for example (
In this context, a worthwhile objective for the main feed production plant of any cooperative is to determine, on a daily basis, (
|Farm Id||Alphanumeric||Farm identifier.|
|Shed Id||Numeric||Shed identifier. Each farm may have several sheds. This value differentiates each shed in a given farm.|
|Entry date||dd-mm-yyyy||Date on which the lot arrives at the farm.|
|Final date||dd-mm-yyyy||Date on which the lot is slaughtered.|
|Feed type||Enumeration||Identifies the specific variety of feed.|
|Breed||Enumeration||Specific breed type of the lot.|
|Activity||Enumeration||This variable indentifies the main role of the lot. There are five possible values: |
|Subactivity||Enumeration||Activity specialization. Each activity may have several specific variants. Possible values include: |
|Type||Enumeration||Class of fatten chicken. Animals used on fatten activity are classified into several categories like ’A’, ‘C’, ‘CA’, ‘H’, ‘M’, etc. This attribute is only used on lots belonging to |
|Number of males||Numeric||Number of male animals the lot.|
|Number of females||Numeric||Number of female animals the lot.|
|Number of unsexed||Numeric||Number of animals for which the sex is unknown.|
|Season||Enumeration||Period of the year in which the lot spends the most days on the farm.|
|Kilograms||Numeric||Amount of feed consumed by the lot from its entry to its slaughter.|
Efficient farming of animal husbandry on such an industrial scale relies largely on several variables and limitations that are normally correlated. In such a situation, characterized by multiple scattered farms that depend on a centralized supply centre, there are two main subjects to take into account: (
Given the large number of variables and rules that play an important role in the whole process of correctly supplying feed for all the farms comprising a given cooperative, the proposed decision support system has been organized into three different but complementary subsystems: (
As it can be seen from the operational diagram showed in Figure 3, each subsystem works with unrelated information coming from both existing data and knowledge bases, but also considering the output of the previous stage. In the proposed architecture, each phase solves a different problem following a bottom-up approach.
Subsystem one: each lot of animals consumes several types of feed during each growth stage. The objective of this phase is to identify the variants of feed that a given lot is going to consume before slaughter, taking into consideration similar past cases belonging to different farms and periods. The current output (qualitative information) serves as the basis for the next phase to compute the total amount of feed needed by both a given lot and a given feed type (quantitative information).
Subsystem two: given a new lot and the different types of feed that it will consume before slaughter, it estimates the total consumption (kilograms) by feed type together with the consumption percentage per animal. Several variables influence the output of this phase: number of animals by sex (male and female), week of year, mortality indexes, etc.
Subsystem three: this integrates previous calculations by using an ad hoc planning algorithm for simulating feed consumption and production along with the time. It allows the user to easily visualize production activity though an intuitive and configurable GUI (Graphic User Interface). Incidences in the initial estimated MPS can be reported and accounted for in reconsidering daily production.
The data used for constructed the system case base was provided by a leading Spanish company specialized in animal feed production and delivery. Raw data (from the years 2007 and 2008) was built from client orders, company production logs, information about the number of animals at different farms and truck trips to the clients. A total of 5112 records were stored in the SP4 internal case base after executing the initial operation for detecting and solving existing inconsistencies (procedure 1.A. in Figure 3). In order to increase the confidence of experimental findings, a cross-validation process was carried out in the all of the experiments.
The final developed system used in the experiments carried out (see Figure 4) was set up to forecast the total amount (kilograms) of each feed type consumed by each lot from its entry date to its slaughter. Moreover, further experiments have been carried out to compare the performance of the SP4 system with several other forecasting approaches. These include standard statistical forecasting algorithms, decision trees and the application of neural networks methods. The results obtained from experimentation revealed that the proposed system performed optimally, being able to track the dynamic non-linear trend and seasonality, as well as the numerous interactions between correlated variables.
3.2. Biomedical domain
In recent years DNA microarray technology has become a fundamental tool in genomic research, making it possible to investigate global gene expression in all aspects of human disease (Russo et al. 2003). Microarray technology is based on a database of over 40,000 fragments of genes called expressed sequence tags (ESTs), which are used to measure target abundance using the scanned intensities of fluorescence from tagged molecules hybridized to ESTs. Following the advent of high-throughput microarray technology it is now possible to simultaneously monitor the expression levels of thousands of genes during important biological processes and across collections of related samples. Since the number of examined genes in an experiment runs to the thousands, different data mining techniques have been intensively used to analyze and discover knowledge from gene expression data (Piatetsky-Shapiro & Tamayo 2003). However, having so many fields relative to so few samples creates a high likelihood of finding false positives. This problem is increased if we consider the potential errors that can be present in microarray data.
Bioinformatics and medical informatics are two research fields that serve the needs of different but related communities. Both domains share the common goal of providing new algorithms, methods and technological solutions to biomedical research, and contributing to the treatment and cure of diseases. Although different microarray techniques have been successfully used to investigate useful information for cancer diagnosis at the gene expression level, the true integration of existing methods into day-to-day clinical practice is still a long way off (Sittig et al. 2008). Within this context, case-based reasoning emerges as a suitable paradigm specially intended for the development of biomedical informatics applications and decision support systems, given the support and collaboration involved in such a translational development (Jurisica & Glasgow, 2004).
In addressing the issue of bridging the existing gap between biomedical researchers and clinicians who work in the domain of cancer diagnosis, prognosis and treatment using microarray data, we have developed and made accessible a common interactive framework: the geneCBR decision support system (Glez-Peña et al. 2009a). Our geneCBR system implements a freely available software tool that allows the use of combined techniques that can be applied to gene selection, clustering, knowledge extraction and prediction for aiding diagnosis in cancer research. For biomedical researches, geneCBR expert mode offers a core workbench for designing and testing new techniques and experiments. For pathologists or oncologists, geneCBR diagnostic mode implements an effective and reliable system that can diagnose cancer subtypes based on the analysis of microarray data using CBR architecture. For programmers, geneCBR programming mode includes an advanced edition module for run-time modification of previous coded techniques.
In order to initially construct the knowledge base starting from the available patient’s data showed in Table 2, geneCBR stores the gene expression levels of each microarray sample in its case base (lower part of Figure 5).
|NAME||Alphanumeric||Unique identifier. Denomination of the microarray sample.|
|#Age||Numeric||Age of the patient.|
|#Sex||Enumeration||Possible values are: |
|#FAB/WHO||Alphanumeric||The FAB classification is a morphological characterization of leukemia. The WHO classification incorporates the results of chromosome and genetic research developed during the past 25 years after the FAB classification.|
|#FISH studies||Alphanumeric||FISH studies are used to delineate complex chromosome rearrangements, diagnose microdeletion syndromes, or demonstrate the presence of molecular rearrangements characteristic of certain hematologic malignancies.|
|Gene name ||Alphanumeric||Human gene name or identifier.|
|Gene value ||Numeric||Microarray gene expression value for the associated gene identifier.|
|….||…||The number of gene name-value pairs depends on the microarray type (Affymetrix HG-U133A/B/Plus, etc.).|
|Gene name ||Alphanumeric||Human gene name or identifier.|
|Gene value ||Numeric||Microarray gene expression value for the associated gene identifier.|
|Class||Alphanumeric||Type of disease.|
During the retrieval stage, the original case vectors are transformed into fuzzy microarray descriptors (FMDs). Each FMD is a comprehensible descriptor of the sample in terms of a linguistic label for each gene expression level (central part of Figure 5). This transformation is carried out by means of an accurate fuzzy discretization process (Díaz et al. 2006). Based on the FMD representation created from the case base, a set of fuzzy patterns (FP) is constructed that represents the main characteristics of the a priori known classes (top-left square in Figure 5). Each class in the system is then represented by a FP that holds the fuzzy codification of gene expression levels for those genes that were flagged as relevant for this class. Several FPs are generated from the data in a supervised way, each one representing a group of FMDs for pathological (or generic) specy.
The retrieval stage in the geneCBR system uses the FPs in order to select the most representative genes given a new patient. This phase can be thought of as a gene selection step, in which the aim is to retrieve the list of genes that might be most informative given a new sample to classify. Since it is highly unlikely that all the genes have significant information related to cancer classification and the dimensionality would be too great if all the genes were used, it is necessary to explore an efficient way to obtain the most suitable group of genes. In order to make this selection, our geneCBR system selects those fuzzy patterns from its case base which are the nearest to any new case obtained. Then, for each one of the selected FPs, the geneCBR system computes its associated DFP (a pattern which only includes the genes that are necessary in order to discriminate the novel instance from other different classes). Finally, the selected genes for the new case are obtained by joining together the genes belonging to the DFPs considered.
The adaptation of previous cases in order to solve a new FMD is accomplished in the reuse stage (left bottom square in Figure 5). A growing cell structures (GCS) network (Fritzke, 1993) is trained with the whole case base, only taking the existing cases represented by the genes selected in the previous stage as input. Then, the new FMD is presented to the GCS network and the patients most similar from a genetic point of view are retrieved. Based on this grouping, a proportional weighted voting mechanism is applied that ponders the level of similarity with the new FMD. An initial class is assigned by the geneCBR system from among the existing pathologies (Glez-Peña et al. 2009b).
In the revision stage (right bottom square in Figure 5) the expert is provided with useful data about the decision made by the system. This information contains the selected DFP genes, the grouping made by the GCS network, the weighting assigned to each class and a set of See5 (Quinlan, 2000) classification rules generated from the most similar patients. The expert contrasts the initial prediction given by the system with other external information such as patient karyotype or clinical history in order to ascertain a revised prediction and a final diagnostic.
Every time a new FMD is solved, the internal structure of the geneCBR system is updated (top-right square in Figure 5). The new FMD is associated to its corresponding class and added to the case base. The affected FP is updated and the system marks the most similar patients selected for future classifications. In this stage the geneCBR system changes to edit-mode and the expert is permitted to update patient’s classification taking into account the new knowledge generated.
The geneCBR represents a new translational decision support system that can effectively support the integrative work of programmers, biomedical researches and clinicians working together in a common framework. Figure 6 shows a screenshot of the geneCBR system working in expert mode (specially intended for biomedical researches). The code of the project is freely available under the GPL license and can be obtained at http://www.genecbr.org/.
3.3. Oceanographic forecasting
The oceans of the world form a highly dynamic system for which it is difficult to create mathematical models (Tomczak & Fodfrey, 1994). Red tides are the name for the discolourations caused by dense concentrations of microscopic sea plants, known as phytoplankton. The discolouration varies with the species of phytoplankton, its pigments, size and concentration, the time of day, the angle of the sun and other factors. Red tides usually occur along the North West coast of the Iberian Peninsula in late summer and autumn. The prevailing southerly winds cause cold, nutrient-rich water to rise up from the deeper regions of the ocean to the surface, a process known as upwelling. Swept along with this upwelled water are dinoflagellate cysts, the resting stages of the organism, which lie dormant in the sediments on the sea floor. The high nutrient concentrations in the upwelled water, together with ideal conditions of temperature, salinity and light, trigger the germination of the cysts, so that the dinoflagellates begin to grow and divide. The rapid increase in dinoflagellate numbers, sometimes to millions of cells per liter of water, is described as a bloom of phytoplankton (concentration levels above the 100,000 cells/liter). Concentration of the bloom by wind and currents, as well as the dinoflagellates’ ability to swim to the surface, combines to form a red tide.
In this context, two situations of special interest are those corresponding to the false alarms and the blooms not detected. The former refers to predictions of bloom (concentration of pseudo-nitzschia ≥ 100,000 cell/liter) which do not actually materialize (real concentration ≤ 100,000 cell/liter). The latter, more problematic, occurs when a bloom exists but the model fails to detect it. Another unwelcome situation takes place when the number of predictions exceeds an absolute error of 100,000 cell/liter (labelled as incorrect predictions). In such a situation, in which the rules governing a target process or system are unknown, the prediction of the parameter values that determine the characteristic behaviour of the system can be a problematic task. However, it has been found that a hybrid case-based reasoning system can provide a more effective means of performing such predictions than other connectionist or symbolic techniques (Fdez-Riverola & Corchado, 2003).
Related with this domain we describe the FSfRT (Forecasting System for Red Tides) system, a hybrid model able to accurately forecast the concentrations of pseudo-nitzschia spp, the diatom that produces the most harmful red tides causing amnesic shellfish poisoning (or ASP). Our FSfRT system employs a case-based reasoning model to wrap a growing cell structures network, a radial basis function network (Fritzke, 1994) and a set of Sugeno fuzzy models (Jang et al. 1997) to provide an accurate prediction. Each of these techniques is used at a different stage of the reasoning cycle of the decision support system to retrieve historical data, to adapt it to the present problem and to automatically review the proposed solution.
The forecasting system uses information from two main sources: (i) data coming from several buoys and monitoring net used to create a succession of problem descriptors able to characterize the current forecasting situation and (ii) data derived from satellite images stored on a database. The satellite image data values are used to generate cloud and superficial temperature indices which are then stored with the problem descriptor and subsequently updated during the CBR operation. Figure 7 shows a schematic view of the whole data managed by the FSfRT system.
In order to forecast the concentration of pseudo-nitzschia spp at a given point a week in advance, a problem descriptor is generated on a weekly basis. A problem descriptor consists of a sequence of sampled data values (filtered and pre-processed) recorded from the water mass to which the forecast will be applied. The problem descriptor also contains various other numerical values, including the current geographical location of the sensor buoys and the collection time and date. Every week, the concentration of pseudo-nitzschia spp is added to a problem descriptor forming a new input vector. The problem descriptor is composed of a vector with the variables that characterise the problem recorded over two weeks. The prediction or output of the system is the concentration of pseudo-nitzschia spp one week later, as indicated in Table 3.
The cycle of forecasting operations (which is repeated every week) proceeds as depicted in Figure 8. First a new problem instance is created from the pre-processed data cited above. When a new problem is presented to the system, the GCS neuronal network is used to obtain k more similar cases to the given problem (identifying the class to which the problem belongs). In the reuse phase, the values of the weights and centers of the neural network used in the previous forecast are retrieved from the knowledge base. These network parameters together with the k retrieved cases are then used to retrain the RBF network and to obtain an initial forecast of the concentration of pseudo-nitzschia spp (see Figure 8). During this process the values of the parameters that characterise the network are updated.
|Location||Alphanumeric||Geographical location of the sensor buoy.|
|Date||dd-mm-yyyy||Date on which the measure was made.|
|Time||hh-mm-ss||Time on which the measure was made.|
|Temperature||Numeric||Water temperature (cent. degrees) at different depths.|
|Oxygen||Numeric||Oxygen concentration (milliliters/liter) at different depths.|
|Transmitance||Numeric||Fraction (percentage) of sun light that passes through the sea-water.|
|Fluorescence||Numeric||Sea-water fluorescence (percentage) at different depths.|
|Cloud index||Numeric||Cloud measurement derivate from a geostationary satellite.|
|Recount of diatoms||Numeric||Algae concentration (in cell/liter) at different depths.|
|Pseudo-nitzschia spp||Numeric||Diatom concentration (in cell/liter) at different depths causing harmful algae blooms.|
|Numeric||Diatom concentration (in cell/liter) to be predicted.|
In the revision phase, the initial solution proposed by the RBF neural network is modified according to the responses of the four fuzzy revision subsystems. Each revision subsystem has been created from the RBF network using neurofuzzy techniques (Jin & Sendhoff, 2003). For each class of the GCS neural network a vector of four values is maintained (see Figure 8). This expert’s score vector is initialised with a value of (0.25, 0.25, 0.25, 0.25) and represents the accuracy of each revision subsystem with respect to a class. During revision, the appropriate expert’s score vector is used to ponder the outputs of each fuzzy revision system. Each vector value is associated with one of the four revision subsystems. For each forecasting cycle, the value of the importance vector associated to the most accurate revision subsystem is increased and the other three values are proportionally decreased. This is done in order to give more relevance to the most accurate revision subsystem.
The revised forecast is then retained temporarily in the forecast database. When the real value of the concentration of pseudo nitzschia spp is measured, the forecast value for the variable can then be evaluated, though comparison of the actual and forecast value and the error obtained. A new case, corresponding to this forecasting operation, is then stored in the case base. The forecasting error value is also used to update the importance vector associated with the revision subsystems of the retrieved class.
The FSfRT system was successfully tested using real data collected from years [1992, 2000] coming from geographical area A0 (42º28.90’ N, 8º57.80’ W 61 m). Figure 9 shows a screenshot of the FSfRT interface implemented for oceanographic forecasting.
3.4. Anti-spam filtering
The E-mail service is a computer-based technology built as the result of transforming the old postal delivery in order to use it over networks and Internet. Nowadays, e-mail addresses are present on every business card close to other relevant contact info such as the postal address or the phone number. However, for more than one decade the use of e-mail has been bedeviled by the curse of spamming, so spam is beginning to undermine the integrity of e-mail and even to discourage its use.
In this context, spam is a term used to designate all forms of unsolicited commercial e-mail and can be formally defined as an electronic message satisfying the following two conditions: (i) the recipient's personal identity and context are irrelevant because the message is equally applicable to many other potential recipients and (ii) the recipient has not verifiably granted deliberate, explicit, and still-revocable permission for it to be sent (SpamHaus, 1998).
Due to some attractive characteristics of e-mail (low cost & fast delivery) it actually becomes the main distribution channel of spam contents. Every day e-mail users receive lots of messages containing offers to buy illegal drugs, replicas of Swiss watches, fake jobs, forged university diplomas, etc. This situation has led to a progressive increasing of the spam global ratio in email traffic. During September 2010, the percentage of spam deliveries accounted for about 92 percent of all Internet e-mail traffic (MessageLabs, 2010).
In order to successfully fight against spam (i.e. ideally eliminate it), both theoretical and applied research on spam filtering becomes fundamental. In this context, much valuable research work has been previously carried out (Guzella & Caminhas, 2009) and some relevant conferences have grown up in the field (CEAS, 2010). Moreover, several commercial products have been released and distributed from the software industry to a huge amount of final users with the goal of minimizing spam drawbacks.
With the goal of providing an effective solution we present the SpamHunting system (Fdez-Riverola et al. 2007), an instance-based reasoning e-mail filtering model that outperforms classical machine learning techniques and other successful lazy learner’s approaches in the domain of anti-spam filtering. The architecture of the decision support filter is based on a tuneable enhanced instance retrieval network able to accurately generalize e-mail representations. The reuse of similar messages is carried out by a simple unanimous voting mechanism to determine whether the target case is spam or not. Previous to the final response of the system, the revision stage is only performed when the assigned class is spam whereby the system employs general knowledge in the form of meta-rules.
In order to correctly represent incoming e-mails, a message descriptor (instance) is generated and stored in the e-mail base of the SpamHunting system. This message descriptor contains the sequence of features that better summarize the information contained in the e-mail. For this purpose, we use data from two main sources: (i) information obtained from the header of the e-mail and (ii) those terms that are more representative of the subject, body and attachments of the message. Table 4 summarizes the structure of each instance stored in the SpamHunting e-mail base.
Figure 10 illustrates the life cycle of the IBR SpamHunting system as well as its integration within a typical user environment. In the upper part of Figure 10, the mail user agent (MUA) and the mail transfer agent (MTA) are in charge of dispatching the requests generated by the user. Between these two applications, SpamHunting captures all the incoming messages (using POP3 protocol) in order to identify, tag and filter spam.
|ID||Numeric||Unique message identifier.|
|Return Path||Alphanumeric||Indicates the address that the message will be returned to if one chose to reply.|
|Date||dd-mm-yyyy||Date in which the message was sent.|
|Language||Alphanumeric||Particular tongue of the message.|
|Attached Files||Numeric||Indicates the number of attached files.|
|Content Type||Enumeration||MIME type.|
|Relevant Terms||Numeric||Number of selected features to cluster the message.|
|Total Terms||Numeric||Number of features contained in the message.|
|Frequency-Term Descriptor||Array of feature-frequency pairs||Storing for each feature a measure of their frequency in the message.|
|Class||Enumeration||Message category. Possible values are: |
Whenever SpamHunting receives a new e-mail, the system evolves through the four steps depicted in the lower part of Figure 10 as shadowed rectangles. Initially the system identifies those e-mails that best represent the new incoming message (left upper quadrant in Figure 10), only taking into account the set of messages with the highest number of terms in common. Each one of the previous selected messages contributes with one vote to the final class (left bottom quadrant in Figure 10). The revision stage is only carried out when the system proposes an e-mail as spam. For this purpose, SpamHunting uses previous encoded header meta-rules (right bottom quadrant in Figure 10). Every time the user checks his mailbox and provides feedback to the system about a previous e-mail classification, SpamHunting stores a new instance (or modifies an existing one) in the e-mail base for future use (right upper quadrant in Figure 10).
The retrieval stage is carried out using our enhanced instance retrieval network (EIRN) model. The EIRN network facilitates the indexation of instances and the selection of those that are most similar to the instance-message. The reuse of similar e-mails is carried out by means of the utilization of a unanimous voting mechanism, which generates an initial solution by creating a model with the retrieved instances. In the revision stage the system employs general knowledge in the form of meta-rules that are extracted from the e-mail headers. Finally, the retain (learning) stage is carried out whenever the system classifies an incoming e-mail, updating the knowledge structure of the whole system. The proposed system also takes into account the feedback of the user when it receives an incorrectly classified e-mail (dotted arrow from the user computer to the e-mail base in Figure 10).
One of the most relevant features of our EIRN structure is the possibility of generating different views of the underlying indexed knowledge. Figure 11 shows a screenshot of the software developed for taking advantage of this capacity. As we can observe from Figure 11, the application frame is structured into six panels. First, on the top of the left column, we can find all the relevant terms that the EIRN network uses for indexing existing instances. Under this panel, a section is located that summarizes some statistics referring to the selected terms including: (i) probability of finding the term in the case base (ii) frequency of finding the term in spam e-mails and (iii) probability of finding the term in legitimate messages.
At the top of the right column, we can find all the instances stored in the system memory. Under this panel, an instance statistics panel can be found where all the relevant terms belonging to the selected messages as well as their frequencies are showed.
At the top of the central column, we have placed a plot representing the relevant terms indexed in our EIRN model. The plot panel has been developed with built-in clipboard copy support, save images capability and drag and drop compatibility for use with any image application or word processor. Both selected terms belonging to the left panel or part of a message from the right panel are always highlighted in our graphic representation.
The EIRN viewer module represents each term as a two-dimension point on the plot. The coordinates of each point are computed according to the function selected in the combo boxes placed under the plot. The following measurements are available for each coordinate: (i) the probability of finding the term t in spam e-mails stored in the system memory, p(t | s, K), (ii) the logarithmic form of the previous value, –log2(p(t | s, K)) (iii) the probability of finding the term t in legitimate messages, p(t | l, K), (iv) the logarithmic form of the previous value, –log2(p(t | l, K)), and (v) the probability of finding the term t in the system memory, p(t | K).
The results obtained confirm the idea that instance-based reasoning systems can offer a number of advantages in the spam filtering domain. Spam is a disjoint concept and IBR classification works well in this domain. In addition IBR systems can learn over time simply by updating their memory with new instances of spam or legitimate e-mail. Moreover, it provides seamless learning capabilities without the need for a separate learning process and facilitates extending the learning process over different levels of learning. The code can be freely obtained at http://www.spamhunting.org/.
In this chapter we have presented the CBR paradigm as an appropriate methodology for implementing successful decision support systems together with their adequate application to several domains. The adoption of CBR methodology for constructing decision support systems has several remarkable benefits, allowing us to obtain a more general knowledge of the system and to gain a deeper insight into the logical structure of the problem and its solution. Main properties of these systems include (i) their ability to focus on the problem's essential features, (ii) the possibility of solving problems in domains that are only partially understood, (iii) the competence for providing solutions when no algorithmic method is available and (iv) the potential to interpret and manage open-ended and ill-defined concepts.
The main advantages of case-based reasoning paradigm over other alternative approaches for the implementation of decision support systems are related with the fulfilment of the following characteristics: (i) reduce the knowledge acquisition effort (cases are independent from each other), (ii) require less maintenance effort (partially automatically by adding/deleting cases), (iii) improve problem solving performance through reuse, (iv) makes use of existing data (e.g. in databases), (v) improve over time and adapt to changes in the environment and (vi) present high user acceptance (domain experts and novices understand cases quite easy).
However, the effective utilization of case-based reasoning paradigm also presents inherent limitations and/or drawbacks related to its knowledge representation and life cycle: (i) past cases could be inexistent or difficult to represent, (ii) specific techniques have to be defined from scratch for modifying previous cases or their solutions in order to adapt them to the new situations and (iii) in some scenarios, it could be difficult to maintain case-base efficiency because unused cases need to be forgotten. In such situations, specific solutions have to be defined in order to overcome particular difficulties.
This work has been partially funded by the project InNoCBR (10TIC305014PR) from Xunta de Galicia. M. Reboiro-Jato was supported by a pre-doctoral fellowship from University of Vigo.