Knowledge in Imperfect Data

Data bases collecting a huge amount of information pertaining to real-world processes, for example industrial ones, contain a significant number of data which are imprecise, mutually incoherent, and frequently even contradictory. It is often the case that data bases of this kind often lack important information. All available means and resources may and should be used to eliminate or at least minimize such problems at the stage of data collection. It should be emphasized, however, that the character of industrial data bases, as well as the ways in which such bases are created and the data are collected, preclude the elimination of all errors. It is, therefore, a necessity to find and develop methods for eliminating errors from already-existing data bases or for reducing their influence on the accuracy of analyses or hypotheses proposed with the application of these data bases. There are at least three main reasons for data preparation: (a) the possibility of using the data for modeling, (b) modeling acceleration, and (c) an increase in the accuracy of the model. An additional motivation for data preparation is that it offers a possibility of arriving at a deeper understanding of the process under modeling, including the understanding of the significance of its most important parameters.


A taxonomy of data preparation
Discussing the issue of data preparation the present work uses the terms process, stage, task, and operation. Their mutual relations are diagrammatically represented in Fig. 1 below. The data preparation process encompasses all the data preparation tasks. Two stages may be distinguished within it: the introductory stage and the main stage. The stages of the process involve carrying out tasks, each of which, in turn, involves a range of operations. The data preparation process always starts with the introductory stage. Within each stage (be it the introductory or the main one) different tasks may be performed. In the introductory stage, the performance of the first task (the choice of the first task is dictated by the specific nature of the case under analysis) should always be followed by data cleaning. It is only after data cleaning in the introductory stage that performing the tasks of the main stage may be initiated. In the main stage, the choice of the first task, as well as ordering of the tasks that follow are not predetermined and are dependent on the nature of the case under consideration. As far as the data collected in real-life industrial (production) processes are concerned, four tasks may be differentiated:  data cleaning is used in eliminating any inconsistency or incoherence in the collected data  data integration makes possible integrating data bases coming from various sources into a single two-dimensional table, thanks to which algorithmized tools of data mining can be employed  data transformation includes a number of operations aimed at making possible the building of a model, accelerating its building, and improving its accuracy, and employing, among others, widely known normalization or attribute construction methods  data reduction limits the dimensionality of a data base, that is, the number of variables; performing this task significantly reduces the time necessary for data mining.
Each of the tasks involves one or more operations. An operation is understood here as a single action performed on the data. The same operation may be a part of a few tasks.
In Fig. 1, within each of the four tasks the two stages of the process are differentiated: the introductory stage of data preparation (in the diagram, this represented as the outer circles marked with broken lines) and the main stage of data preparation (represented as inner circles marked with solid lines). These two separate stages, which have been differentiated, have quite different characters and use quite different tools. The first stage -that of the introductory data preparation -is performed just once. Its performance in particular tasks is limited to a single operation. This operation is always followed by the task of data cleaning. The stage of the introductory data cleaning is a non-algorithmized stage. It is not computer-aided and it is based on the knowledge and the experience of the human agent preparing the data.
The second -that is, the main -stage is much more developed. It can be repeated many times at each moment of the modeling and of the analysis of the developed model. The repetition of the data preparation tasks or the change in the tools employed in particular tasks is aimed at increasing the accuracy of the analyses. The order in which the tasks are carried out may change depending upon the nature of the issue under analysis and the form of the collected data. According to the present authors, it seems that it is data cleaning which should always be performed as the first task. However, the remaining tasks may be performed in any order. In some cases different tasks are performed in parallel: for instance the operations performed in the task of data integration may be performed simultaneously with the operations involved in data transformation, without deciding on any relative priority. Depending on the performed operations, data reduction may either precede data integration (attribute selection) or go at the very end of the data preparation process (dimensionality selection). Fig. 1. Tasks carried out in the process of data preparation with the distinction into the introductory and the main stage (Kochanski, 2010) 3. Task of data preparation A task is a separate self-contained part of data preparation which may be and which in practice is performed at each stage of the data preparation process. We can distinguish four tasks (as represented in Fig. 1): data cleaning, data transformation, data integration, and data reduction. Depending upon the stage of the data preparation process, a different number of operations may be performed in a task -at the introductory stage the number of operations is limited. These operations, as well as the operations performed in the main stage will be discussed below, in the sections devoted to particular tasks.
The data preparation process, at the stage of introductory preparation, may start with any task, but after finishing the introductory stage of this task, it is necessary to go through the introductory stage of data cleaning. It is only then when one can perform operations belonging to other tasks, including the operations of the task with which the process of data preparation has started. This procedure follows from the fact that in the further preparation, be it in the introductory or in the main stage, the analyst should have at his disposal possibly the most complete data base and only then make further decisions.

www.intechopen.com
It is only after the introductory stage is completed in all tasks that the main stage can be initiated. This is motivated by the need to avoid any computer-aided operations on raw data.

Data cleaning
Data cleaning is a continuous process, which reappears at every stage of data exploration. However, it is especially important at the introductory data preparation stage, among others, because of how much time these operations take. At this phase, it involves a onetime data correction, which resides in the elimination of all kind of errors resulting from human negligence at the stage of data collection. The introductory data cleaning is a laborious process of consulting the source materials, often in the form of paper records, laboratory logs and forms, measuring equipment outprints, etc., and filling in the missing data by hand. The overall documentation collected at this stage may also be used in the two other operations of this task: in accuracy improvement and in inconsistency removal. As shown in Fig. 2, the data cleaning task may involve three operations, the replacement of the missing or empty values, the accuracy improvement, and the inconsistency removal, which in the main stage employ algorithmized methods:  replacement of missing or empty values employs the methods of calculating the missing or empty value with the use of the remaining values of the attribute under replacement or with the use of all attributes of the data set;  accuracy improvement is based on the algorithmized methods of the replacement of the current value with the newly-calculated one or the removal of this current value;  inconsistency removal most frequently employs special procedures (e.g. control codes) programmed in data collection sheets prior to the data collecting stage.
In what follows, these three operations will be discussed in detail because of their use in the preparation of the data for modeling.
a. The replacement of missing or empty values The case of the absent data may cover two kinds of data: an empty value -when a particular piece of data could not have been recorded, since -for instance -such a value of the www.intechopen.com measured quantity was not taken into consideration at all, and a missing value -when a particular piece of data has not been recorded, since -for instance -it was lost. The methods of missing or empty value replacement may be divided into two groups, which use for the purpose of calculating the new value the subset of the data containing: either  only the values of the attribute under replacement (simple replacement) or  either the specially selected or all the values of the collected set (complex replacement).
The methods of simple replacement include all the methods based on statistical values (statistics), such as, for instance the mean value, the median, or the standard deviation. The new values which are calculated via these methods and which are to replace the empty or the missing values retain the currently defined distribution measures of the set.
The complex replacement aims at replacing the absent piece of data with such a value as should have been filled in originally, had it been properly recorded (had it not been lost). Since these methods are aimed at recreating the proper value, a consequence of their application is that the distribution measures of the replacing data set may be and typically are changed.
In complex replacement the absent value is calculated from the data collected in the subset which contains selected attributes (Fig. 3a) or selected records (Fig. 3b) and which has been created specifically for this purpose. The choice of a data replacement method depends on the properties of the collected data -on finding or not finding correlations between or among attributes. If there is any correlation between, on the one hand, selected attributes and, on the other, the attribute containing the absent value, it is possible to establish a multilinear regression which ties the attributes under analysis (the attribute containing the absent value and the attributes which are correlated with it). In turn, on this basis the absent value may be calculated. An advantage of this method is the possibility of obtaining new limits, that is, a new minimal and maximal value of the attribute under replacement, which is lower or higher than the values registered so far. It is also for this reason that this method is used for the replacement of empty data.
When there is no correlation between the attribute of the value under replacement and the remaining attributes of the set, the absent value is calculated via a comparison with the selected records from the set containing complete data. This works when the absent value is a missing value.
The choice of a data replacement method should take into consideration the proposed data modeling method. Simple replacement may be used when the modeling method is insensitive to the noise in the data. For models which cannot cope with the noise in the data complex data replacement methods should be used.
It is a frequent error commonly made in automatized absent data replacement that no record is made with respect to which pieces of the data are originally collected values and which have been replaced. This piece of information is particularly important when the model quality is being analyzed.
www.intechopen.com X X 1 X 2 X 3 X 4 X 5 X 6 X 7 The operation of accuracy improvement, in the main stage of data preparation, is based on algorithmized methods of calculating a value and either replacing the current value (the value recorded in the database) with the newly-calculated one or completely removing the current value from the base. The mode of operation depends upon classifying the value under analysis as either noisy or an outlier. This is why, firstly, this operation focuses on identifying the outliers within the set of the collected data. In the literature, one can find reference to numerous techniques of identifying pieces of the data which are classified as outliers (Alves & Nascimento, 2002;Ben-Gal, 2005;Cateni et al., 2008;Fan et al., 2006;Mohamed et al., 2007). This is an important issue, since it is only in the case of outliers that their removal from the set may be justified. Such an action is justified when the model is developed, for instance, for the purpose of optimizing an industrial process. If a model is being developed with the purpose of identifying potential hazards and break-downs in a process, the outliers should either be retained within the data set or should constitute a new, separate set, which will serve for establishing a separate pattern. On the other hand, noisy data can, at best, be corrected but never removed -unless we have excessive data at our disposal.
In the literature, there are two parallel classifications of the outlier identification methods. The first employs the division into one-dimension methods and multidimension methods. The second divides the relevant methods into the statistical (or traditional) ones and the ones which employ advanced methods of data exploration. The analysis of one-dimension data employs definitions which recognize as outliers the data which are distant, relative to the assumed metric, from the main data concentration. Frequently, the expanse of the concentration is defined via its mean value and the standard error. In that case, the outlier is a value (the point marked as 1 in Fig. 4a) which is located outside the interval (X m -k* 2SE, X m + k* 2SE), where: SE -standard error, k -coefficient, X m -mean value. The differences between the authors pertain to the value of the k coefficient and the labels for the kinds of data connected with it, for instance the outliers going beyond the boundaries calculated for k = 3,6 (Jimenez- Marquez et al., 2002) or the outliers for k = 1,5 and the extreme data for k = 3 (StatSoft, 2011). Equally popular is the method using a box plot (Laurikkala et al., 2000), which is based on a similar assumption concerning the calculation of the thresholds above and below which a piece of data is treated as an outlier.
For some kind of data, for instance those coming from multiple technological processes and used for the development of industrial applications, the above methods do not work. The data of this kind exhibit a multidimensional correlation. For correlated data a onedimension analysis does not lead to correct results, which is clearly visible from the example in Fig. 4 (prepared from synthetic data). In accordance with the discussed method, Point 1 (marked as a black square) in Fig. 4a would be classified as an outlier. However, it is clearly visible from multidimensional (two-dimensional) data represented in Fig. 4b that it is point 2 (represented as a black circle) which is an outlier. In one-dimension analysis it was treated as an average value located quite close to the mean value.
Commonly used tools for finding one-dimension outliers are statistical methods. In contrast to the case of one-dimension outliers, the methods used for finding multidimensional outliers are both statistical methods and advanced methods of data exploration. Statistical methods most frequently employ the Mahalanobis metric. In the basic Mahalanobis method www.intechopen.com for each vector x i the Mahalanobis distance is calculated according to the following formula (1) (Rousseeuw & Zomeren, 1990): where: T(X) is the arithmetic mean of the data set, C(X) is the covariance matrix.
This method takes into account not only the central point of the data set, but also the shape of the data set in multidimensional space. A case with a large Mahalanobis distance can be identified as a potential outlier. On the basis of a comparison with the chi-square distribution, an outlier can be removed as was suggested in (Filzmoser, 2005). In the literature, further improvements of the Mahalanobis distance method can be found, for example the one called the Robust Mahalanobis Distance (Bartkowiak, 2005). The outlier detection based on the Mahalanobis distance in industrial data, was performed in e.g. (Jimenez-Marquez et al., 2002). As far as data exploration is concerned, in principle all methods are used. The most popular ones are based on artificial neural networks (Alves & Nascimento, 2002), grouping (Moh'd Belal Al-Zgubi, 2009), or visualization (Hasimah et al., 2007), but there are also numerous works which suggest new possibilities of the use of other methods of data mining, for example of the rough set theory (Shaari et al., 2007). An assumption of methodologies employing data exploration methods is that the data departing from the model built with their help should be treated as outliers.
c. Inconsistency removal At the introductory stage inconsistencies are primarily removed by hand, via reference to the source materials and records. At this stage it will also encompass the verification of the attributes of the collected data. It is important to remove all redundancies at this stage. Redundant attributes may appear both in basic databases and in databases created via combining a few smaller data sets. Most often, this is a consequence of using different labels in different data sets to refer to the same attributes. Redundant attributes in merged databases suggest that we have at our disposal a much bigger number of attributes than is in fact the case. An example of a situation of this kind is labeling the variables (columns) referring to metal elements in a container as, respectively, element mass, the number of elements in a container, and the mass of elements in the container. If the relevant piece of information is not repeated exactly, for example, if instead of the attribute the mass of elements in the container what is collected is the mass of metal in the container, then the only problem is increasing the model development time. It is not always the case that this is a significant problem, since the majority of models have a bigger problem with the number of records, than with the number of attributes. However, in the modeling of processes aimed at determining the influence of signal groups, an increase in the number of inputs is accompanied with an avalanche increase in the number of input variable group combinations (Kozlowski, 2009). It should also be remembered that some modeling techniques, especially those based on regression, cannot cope with two collinear attributes (Galmacci, 1996). This pertains also to a small group of matrix-based methods (Pyle, 1999).
At the main stage of data preparation inconsistency removal may be aided with specially designed procedures (e.g. control codes) or tools dedicated to finding such inconsistencies (e.g. when the correlations/interdependencies among parameters are known). www.intechopen.com

Data integration
Because of the tools used in knowledge extraction it is necessary that the data be represented in the form of a flat, two-dimensional table. It is becoming possible to analyze the data recorded in a different form, but the column-row structure -as in a calculation sheet -is the best one (Pyle, 2003). The task of data integration may be both simple and complicated, since it depends on the form in which the data is collected. In the case of synthetic data, as well as in appropriately prepared industrial data collecting systems this is unproblematic. However, the majority of industrial databases develop in an uncoordinated way. Different departments of the same plant develop their own databases for their own purposes. These databases are further developed without taking into consideration other agencies. This results in a situation in which the same attributes are repeated in numerous databases, the same attributes are labeled differently in different databases, the same attributes have different proportions of absent data in different bases, the developed databases have different primary keys (case identifiers), etc. The situation gets even worse when the databases under integration have been developed not within a single plant but in more distant places, for example in competing plants. The introductory stage of data integration focuses on two of them: identification and redundancy removal. Both these operations are closely connected with one another. The first operation -identification -is necessary, since the task of data integration starts with identifying the quantity around which the database sets may be integrated. Also, at the data integration introductory stage the removal of obvious redundancies should be performed. Their appearance may result, for instance, from the specific way in which the industrial production data is recorded. The results of tests conducted in centrally-operated laboratories are used by different factory departments, which store these results independently of one another, in their own databases. The result of data integration performed without introductory analysis may be that the same quantities may be listed in the end-product database a number of times and, moreover, they may be listed there under different labels. A person who knows the data under analysis will identify such redundant records without any problem.

www.intechopen.com
At the main stage, unlike in other data preparation tasks, the integration operations are only performed with the aid of algorithmized tools and are still based on the knowledge about the process under analysis. At this stage, the operations represented in Fig. 5 are performed with the following aims:  identification -serves the purpose of identifying the attributes which could not have been identified at the introductory stage, for instance, in those cases in which the label does not explain anything,  redundancy removal -just like the attribute selection in the data reduction task, which is characterized below, uses algorithmized methods of comparing the attributes which have been identified only in the main stage with the aim of removing the redundant data,  unification -this is performed so that the data collected in different sets have the same form, for instance the same units.
The operation of identification uses methods which make it possible to identify the correlation between the attribute under analysis and other identified process attributes. In this way it is possible to establish, for instance, what a particular attribute pertains to, where the sensor making the recorded measurement could have been installed, etc.
The second performance of the redundancy removal operation pertains only to attributes which have been identified only in the second, main stage of data integration. As far as this second performance is concerned, redundancy removal may be identified with the attribute selection operation from the data reduction task and it uses the same methods as the attribute selection operation.
Unification is very close to the data transformation task. In most cases it amounts to transforming all the data in such a way that they are recorded in a unified form, via converting part of the data using different scales into the same units, for instance, converting inches into millimeters, meters into millimeters, etc. A separate issue is the unification of the data collected in different sets and originating from measurements employing different methods. In such cases one should employ the algorithms for converting one attribute form into another. These may include both analytical formulas (exact conversion) and approximate empirical formulas. When this is not possible, unification may be achieved via one of the data transformation operations, that is, via normalization (Liang & Kasabov, 2003).
The main principle that the person performing the data integration task should stick to is that the integrated database should retain all the information collected in the data sets which have been integrated. Seemingly, the suggestion, expressed in (Witten & Frank, 2005), that data integration should be accompanied by aggregation, is misguided. Aggregation may accompany integration, but only when this is absolutely necessary.

Data transformation
Data transformation introductory stage usually amounts to a single operation dictated by data integration, such as aggregation, generalization, normalization or attribute (feature) construction. Performing aggregation may be dictated by the fact that the data collected in multiple places during the production process may represent results from different periods: www.intechopen.com a different hour, shift, day, week or month. Finding a common higher-order interval of time is necessary for further analyses. The same reasons make necessary the performance of generalization or normalization. Generalization may be dictated, for instance, by the integration of databases in which the same attribute is once recorded in a nominal scale (e.g. ultimate tensile strength -ductile iron grade 500/07) and once in a ratio scale (ultimate tensile strength -UTS = 572 MPa). We can talk about normalization in the introductory stage only when we understand this term in its widest sense, that is, as a conversion of quantities from one range into those of another range. In that case, such a transformation as measurement unit conversion may be understood as normalization at the introductory stage.
Data transformation encompasses all the issues connected with transforming the data into a form which makes data exploration possible. At the introductory stage it involves six operations represented in Fig. 6:  smoothing -this resides in transforming the data in such a way that local data deviations having the character of noise are eliminated. Smoothing encompasses, among others, the techniques such as, for example, binning, clustering, or regression;  aggregation -this resides in summing up the data, most frequently in the function of time, for example from a longer time period encompassing not just a single shift but a whole month;  generalization -this resides in converting the collected data containing the measurements of the registered process quantity into higher-order quantities, for instance via their discretization;  normalization -this resides in the rescaling (adjustment) of the data to a specified, narrow range, for instance, from 0.0 to 1.0;  attribute (feature) construction -this resides in mathematical transformations of attributes (features) with the aim of obtaining a new attribute (feature), which will replace in modeling its constituent attributes;  accommodation -this resides in transforming the data into a format used by a specific algorithm or a tool, for example into the ARFF format (Witten & Frank, 2005).

www.intechopen.com
Smoothing is an operation which follows from the assumption that the collected data are noisy with random errors or with divergence of measured attributes. In the case of industrial (also laboratory) data such a situation is a commonplace, unlike in the case of business data (the dollar exchange rate cannot possibly be noisy). Smoothing techniques are aimed at removing the noise, without at the same time interfering with the essence of the measured attributes. Smoothing methods may be of two kinds: (a) those which focus on comparing the quantity under analysis with its immediate neighborhood and (b) those which analyze the totality of the collected data.
Group (a) includes methods which analyze a specified lag or window. In the former case, the data are analyzed in their original order, while in the latter they are changed in order. These methods are based on a comparison with the neighboring values and use, for instance, the mean, the weighted moving mean, or the median. Group (b) includes methods employing regression or data clustering. The method of loess -local regression could be classified as belonging to either group.
In the first group of methods (a) a frequent solution is to combine a few of techniques into a single procedure. This is supposed to bring about a situation in which further smoothing does not introduce changes into the collected data. A strategy of this kind is resmoothing, in which two approaches may be adopted: either 1) smoothing is continued up to the moment when after a subsequent smoothing the curve does not change or 2) a specified number of smoothing cycles is performed, but this involves changing the window size. Two such procedures, 4253H and 3R2H were discussed in (Pyle, 1999). A method which should also be included in the first group (a) is the PVM (peak -valley -mean) method.
Binning, which belongs to group (a) is a method of smoothing (also data cleaning) residing in the creation of bins and the ascription of data to these bins. The data collected in the respective bins are compared to one another within each bin and unified via one of a few methods, for example, via the replacement with the mean value, the replacement with the median, or the replacement with the boundary value. A significant parameter of smoothing via binning is the selection of the size of bins to be transformed. Since comparing is performed only within the closest data collected in a single bin, binning is a kind of local smoothing of the data.
As was already mentioned above, the second important group of smoothing methods (b) are the techniques employing all the available data. Among others, this includes the methods of regression and of data clustering. Fig.7 represents the smoothing of the laboratory data attribute (ultimate compressive strength) of greensand in the function of moisture content. The data may be smoothed via adjusting the function to the measured data. Regression means the adjustment of the best curve to the distribution of two attributes, so that one of these attributes could serve for the prediction of the other attribute. In multilinear regression the function is adjusted to the data collected in the space which is more than two-dimensional.
An example of smoothing with the use of data clustering techniques was discussed in (Kochanski, 2006). In the data which is grouped the outliers may easily be detected and either smoothed or removed from the set. This procedure makes possible defining the range of variability of the data under analysis. In the case of industrial data the issue of smoothing should be treated with an appropriate care. The removal or smoothing of an outlier may determine the model's capability for predicting hazards, break-downs or product defects. The industrial data often contain quantities which result from the process parameter synergy. The quantity which may undergo smoothing is, for instance, the daily or monthly furnace charge for the prediction of the trend in its consumption, but not the form moisture content for the prediction of the appearance of porosity. In manufacture processes, when different operations are performed in different places and records are collected in separate forms, a frequent situation is the impossibility of integrating the separate cases into a single database. The absence of a single key makes data integration impossible. (Research report, 2005) discusses data integration with the use of aggregation. Data aggregation (represented as parentheses) and data location on the time axis, which makes possible further integration, are diagrammed in Fig. 8.
Generalization in data preparation is an operation which is aimed at reducing the number of the values that an individual attribute may take. In particular, it converts a continuous quantity into an attribute which takes the value of one of the specified ranges. The replacement of a continuous quantity with a smaller number of labeled ranges makes easier grasping the nature of the correlation found in the course of data exploration. Generalization applies to two terms: discretization and notional hierarchy.
Discretization is a method which makes possible splitting the whole range of attribute variation into a specified number of subregions. Discretization methods in the two main classifications are characterized in terms of the direction in which they are performed or in terms of whether or not in feature division they employ information contained in the features other than the one under discretization. In the first case, we talk about bottom-up or top-down methods. In the second case, we distinguish between supervised and unsupervised methods.
The notional hierarchy for a single feature makes possible reducing the number of the data via grouping and the replacement of a numerical quantity, for instance the percentage of carbon in the chemical composition of an alloy, with a higher-order notion, that is, a lowcarbonic and a high-carbonic alloy. This leads to a partial loss of information but, in turn, thanks to the application of this method it may be easier to provide an interpretation and in the end this interpratation may become more comprehensible.
There are many commonly used methods of data normalization (Pyle, 1999;Larose, 2008). The effect of performing the operation of normalization may be a change in the variable range and/or a change in the variable distribution. Some data mining tools, for example artificial neural networks, require normalized quantities. Ready-made commercial codes have special modules which normalize the entered data. When we perform data analysis, we should take into consideration the method which has been employed in normalization. This method may have a significant influence on the developed model and the accuracy of its predictions (Cannataro, 2008;Ginoris et al., 2007;.
The following are the main reasons for performing normalization:  a transformation of the ranges of all attributes into a single range makes possible the elimination of the influence of the feature magnitude (their order 10, 100, 1000) on the developed model. In this way we can avoid revaluing features with high values,  a transformation of the data into a dimensionless form and the consequent achievement of commensurability of a few features makes possible calculating the case distance, for instance with the use of the Euclid metric,  a nonlinear transformation makes possible relieving the frequency congestion and uniformly distributing the relevant cases in the range of features,  a nonlinear transformation makes it possible to take into consideration the outliers,  data transformation makes possible a comparison of the results from different tests (Liang & Kasabov, 2003).
The following may be listed as the most popular normalization methods:  min-max normalization -it resides, in the most general form, in the linear transformation of the variable range in such a way that the current minimum takes the value 0 (zero), while the current maximum takes the value 1 (one). This kind of normalization comes in many variants, which differ from one another, among others, with respect to the new ranges, for instance (-1,1); (-0,5, 0,5);  standarization -it resides in such a transformation in which the new feature has the expectation mean value which equals zero and the variance which equals one. The most frequently used standardization is the Z-score standardization, which is calculated according to the following formula (2): ( 2) where: x -feature under standarization, μ -mean value, σ -standard population deviation;  decimal normalization, in which the decimal separator is moved to the place in which the following equation (3) is satisfied: where: j is the smallest integer value, for which Max (|X 1 |) < 1.
A weakness of the normalization methods discussed above is that they cannot cope with extreme data or with the outliers retained in the set under analysis. A solution to this problem is a linear -nonlinear normalization, for example the soft-max normalization or a nonlinear normalization, for example the logarithmic one (Bonebakker, 2007).
As was mentioned above, the nonlinear normalization makes it possible to include into the modeling data set extreme data or even outliers, as well as to change the distribution of the variable under transformation. The logarithmic transformation equations given in (4a,b) and the tangent transformation equation in (5) were employed in (Olichwier, 2011): where: w, c -coefficients where: k -coefficient, μ -mean value, σ -standard population deviation.
The use of the logarithmic transformation made it possible to locally spread the distributions of the selected parameters. The original skew parameter distribution after the www.intechopen.com transformation approached the normal distribution. One effect of the transformation in question was a visible increase in the prediction quality of the resulting model.
The data in the databases which are under preparation for mining may be collected in many places and recorded in separate sets. As a result of integrating different data sources it is possible to perform their transformation, for instance in the domain of a single manufacture process. The data represented by two or more features may be replaced with a different attribute. A classical example of this is the replacement of the two dates defining the beginning and the end of an operation, which were originally recorded in two different places, with a single value defining the duration of the operation, for instance the day and the hour of molding, as well as the day and the hour of the mould assembly are replaced with the mould drying time. A deep knowledge of the data makes possible using mathematical transformations which are more complex than a mere difference. As a consequence, new, semantically justified feature combinations are created. An example of this was discussed in (Kochanski, 2000).
The knowledge of the properties of the algorithm employed in data mining is a different kind of reason behind the creation of new attributes. In the case of algorithms capable only of the division which is parallel to the data space axis (for instance in the case of the majority of decision trees) it is possible to replace the attributes pertaining to features characterized with linear dependence with a new attribute which is the ratio of the former attributes (Witten & Frank, 2005).
It is possible to perform any other transformations of attributes which do not have any mathematical justification but which follow from the general world knowledge: the names of the days of the week may be replaced with the dates, the names of the elements may be replaced with their atomic numbers, etc.
The last encountered way of creating attributes is the replacement of two attributes with a new attribute which is their product. A new attribute created in this way may have no counterpart in reality.
The common use of accommodation suggests that it should be considered as one of the issues of data transformation. What is used in data mining are databases which have been created without any consideration for their future specific application, that is, the application with the use of selected tools or algorithms. In effect, it is often the case that the collected data format does not fit the requirements of the tool used for mining, especially in a commercial code. The popular ARFF (attribute -relation file format) makes use only of the data recorded in a nominal or interval scale. The knowledge gathered in the data recorded in a ratio scale will be lost as a result of being recorded in the ARFF format.
The calculation spreadsheet Excel, which is popular and which is most frequently used for gathering the industrial data, may also be the cause of data distortion. Depending upon the format of the cell in which the registered quantity is recorded, the conversion of files into the txt format (which is required by a part of data mining programs) may result in the loss of modeling quality. This is a consequence of the fact that a defined cell contains the whole number which was recorded in it but what is saved in file conversion is only that part of this number which is displayed (Fig.9). Fig. 9. Data conversion -the conversion of a file from the xls to the txt format brings about a partial loss of information (the authors' own work)

Data reduction
At the introductory stage of data preparation this task is limited to a single operationattribute selection -and it is directly connected with the expert opinion and experience pertaining to the data under analysis. It is only as a result of a specialist analysis (for example, of brainstorming) that one can remove from the set of collected data those attributes which without question do not have any influence on the features under modeling.
The aim of the main stage data reduction techniques is a significant decrease in the data representation, which at the same time preserves the features of the basic data set. Reduced data mining makes then possible obtaining models and analyses which are identical (or nearly identical) to the analyses performed for the basic data.
Data reduction includes five operations, which are represented in Fig. 10:  attribute selection -this resides in reducing the data set by eliminating from it the attributes which are redundant or have little significance for the phenomenon under modeling;  dimension reduction -this resides in transforming the data with the aim of arriving at a reduced representation of the basic data;  numerosity reduction -this is aimed at reducing the data set via eliminating the recurring or very similar cases;  discretization -this resides in transforming a continuous variable into a limited and specified number of ranges;  aggregation -this resides in summing up the data, most frequently in the function of time, for instance from a longer time period encompassing not just the period of a single shift but e.g. the period of a whole month.
The collected industrial database set may contain tens or even hundreds of attributes. Many of these attributes may be insignificant for the current analysis. It is frequently the case that integrated industrial databases contain redundant records, which is a consequence of the specific way in which the data is collected and stored in a single factory in multiple places.
Keeping insignificant records in a database which is to be used for modeling may lead not only to -often a significant -teaching time increase but also to developing a poor quality model. Of course, at the introductory stage, a specialist in a given area may identify the group of attributes which are relevant for further modeling, but this may be very timeconsuming, especially given that the nature of the phenomenon under analysis is not fully known. Removing the significant attributes or keeping the insignificant ones may make it very difficult, or even impossible, to develop an accurate model for the collected data. Fig. 10. The operations performed in the data reduction task (with the introductory stage represented as the squared area and the main stage represented as the white area) At the main stage, attribute selection is performed without any essential analysis of the data under modeling, but only with the help of selection algorithms. This is supposed to lead to defining the intrinsic (or effective) dimensionality of the collected data set (Fukunaga, 1990) The algorithms in question are classified as either filters or wrappers, depending on whether they are employed as the operation preceding the modeling proper (filters) or as the operation which is performed alternately with the modeling (wrappers).
Attribute selection is performed for four reasons: (a) to reduce the dimensionality of the attribute space, (b) to accelerate the learning algorithms, (c) to increase the classification quality, and (d) to facilitate the analysis of the obtained modeling results (Liu et al., 2003). According to the same authors, it may -in particular circumstances -increase the grouping accuracy.
Dimensionality reduction is the process of creating new attributes via a transformation of the current ones. The result of dimensionality reduction is that the current inventory of attributes X{x 1 ,x 2 ,…,x n } is replaced with a new one Y{y 1 ,y 2 ,…,y m } in accordance with the following equation (6): where F() is a mapping function and m < n. In the specific case Y 1 = a 1 x 1 + a 2 x 2 , where a 1 and a 2 are coefficients (Liu & Motoda, 1998). The same approach, which takes into consideration in its definition the internal dimensionality, may be found in (Van der Maaten et al., 2009). The collected data are located on or in the neighborhood of a multidimensional curve. This curve is characterized with m attributes of the internal dimensionality, but is placed in the ndimensional space. The reduction frees the data from the excessive dimensionality, which is important since it is sometimes the case that m<<n.
The operation of dimensionality reduction may be performed either by itself or after attribute selection. The latter takes place when the attribute selection operation has still left a considerable number of attributes used for modeling. The difference between dimensionality reduction and attribute selection resides in the fact that dimensionality reduction may lead to a partial loss of the information contained in the data. However, this often leads to the increase in the quality of the obtained model. A disadvantage of dimensionality reduction is the fact that it makes more difficult the interpretation of the obtained results. This results from the impossibility of naming the new (created) attributes.
In some publications attribute selection and dimensionality reduction are combined into a single task of data dimensionality reduction 1 (Chizi & Maimon, 2005;Fu & Wang, 2003;Villalba & Cunningham, 2007). However, because of the differences in the methods, means, and aims, the two operations should be considered as distinct.
Discretization, in its broadest sense, transforms the data of one kind into the data of another kind. In the literature, this is discussed as the replacement of the quantitative data with the qualitative data or of the continuous data with the discrete data. The latter approach is the approach pertaining to the most frequent cases of discretization. It should be remembered, however, that such an approach narrows down the understanding of the notion of discretization. This is the case since both the continuous and the discrete data are numerical data. The pouring temperature, the charge weight, the percent element content, etc. are continuous data, while the number of melts or the number of the produced casts are discrete data. However, all the above-mentioned examples of quantities are numbers. Discretization makes possible the replacement of a numerical quantity, for instance, the ultimate tensile strength in MPa with the strength characterized verbally as high, medium, or low. In common practice ten methods of discretization method classification are used. These have been put forward and characterized in (Liu et al., 2002;Yang et al., 2005). A number of published works focus on a comparison of discretization methods which takes into account the influence of the selected method on different aspects of further modeling, for instance, decision trees. These comparisons most frequently analyze three parameters upon which data preparation -discretization exerts influence:  the time needed for developing the model,  the accuracy of the developed model,  the comprehensibility of the model.
As has been widely demonstrated, the number of cases recorded in the database is decisive with respect to the modeling time. However, no matter what the size of the data set is, this time is negligibly short in comparison with the time devoted to data preparation. Because of that, as well as because of a small size, in comparison, for instance, with the business data sets, numerosity reduction is not usually performed in industrial data sets.
The operation of aggregation was first discussed in connection with the discussion of the data transformation task. The difference between these two operations does not reside in the method employed, since the methods are the same, but in the aims with which aggregation is performed. Aggregation performed as a data reduction operation may be treated, in its essence, as the creation of a new attribute, which is the sum of other attributes, a consequence of this being a reduction in the number of events, that is, numerosity reduction.

The application of the selected operations of the industrial data preparation methodology
For the last twenty years, many articles have been published which discuss the results of research on ductile cast iron ADI. These works discuss the results of research conducted with the aim of investigating the influence of the parameters of the casting process, as well as of the heat treatment of the ductile iron casts on their various properties. These properties include, on the one hand, the widely investigated ultimate tensile strength, elongation, and hardness, and, on the other, also properties which are less widely discussed, such as graphite size, impact, or austenite fraction. The results discussed in the articles contain large numbers of data from laboratory tests and from industrial studies. The data set collected for the present work contains 1468 cases coming both from the journal-published data and from the authors' own research. They are characterized via 27 inputs, such as: the chemical composition (characterized with reference to 14 elements), the structure as cast (characterized in terms of 7 parameters), the features as cast (characterized in terms of 2 parameters), the heat treatment parameters (characterized in terms of 4 parameters), as well as via 11 outputs: the cast structure after heat treatment, characterized in terms of the retained austenite fraction and the cast features (characterized in terms of 10 parameters). 13 inputs were selected from the set, 9 of them characterizing the melt chemical composition and 4 characterizing the heat treatment parameters. Also, 2 outputs were selected -the mechanical properties after treatment -the ultimate tensile strength and the elongation. The set obtained in this way, which contained 922 cases, was prepared as far as data cleaning is concerned, in accordance with the methodology discussed above.
Prior to preparation, the set contained only 34,8 % of completely full records, containing all the inputs and outputs. The set contained a whole range of cases in which outliers were suspected.
For the whole population of the data set under preparation outliers and high leverage points were defined. This made possible defining influentials, that is, the points exhibiting a high value for Cook's distance. In Fig 11, which represents the elongation distribution in the function of the ultimate tensile strength, selected cases were marked (a single color represents a single data source). The location of these cases in the diagram would not make it possible to unequivocally identify them as outliers. However, when this is combined with an analysis of the cases with a similar chemical composition and heat treatment parameters, the relevant cases may be identified as such. With the use of the methodology discussed above, the database was filled in, which resulted in a significant decrease in the percentage of the absent data for the particular inputs and outputs. The degree of the absent data was between 0,5 and 28,1% for inputs and 26,9% for the UTS and 30,6% for the elongation. The result of filling in the missing data was that the degree of the absent data for inputs was reduced to the value from 0,5 to 12,8%, and for strength and elongation to, respectively, 1,3% and 0,8%. Fig. 13 represents the proportion of the number of records in the particular output ranges "before" and "after" this replacement operation. Importantly, we can observe a significant increase in the variability range for both dependent variables, that is, for elongation from the level of 0÷15% to the level of 0÷27,5%, and for strength from the level of 585÷1725MPa to the level of 406÷1725MPa, despite the fact that the percentage of the records in the respective ranges remained at a similar level for the data both "before" and "after" the filling operation. Similar changes occurred for the majority of independent variables -inputs. In this way, via an increase in the learning data set, the reliability of the model was also increased. Also, the range of the practical applications of the taught ANN model was increased, via an increase in the variability range of the parameters under analysis.
The data which was prepared in the way discussed above were then used as the learning set for an artificial neural network (ANN). Basing on the earlier research [4], two data sets were created which characterized the influence of the melt chemical composition and the heat treatment parameters (Tab. 1) of ductile cast iron ADI alloys on the listed mechanical properties. The study employed a network of the MLP type, with one hidden layer and with the number of neurons hidden in that layer equaling the number of inputs. For each of the two cases, 10 learning sessions were run and the learning selected for further analysis was the one with the smallest mean square error.
A qualitative analysis of the taught model demonstrated that the prediction quality obtained for the almost twice as numerous learning data set obtained after the absent data replacement was comparable, which is witnessed by a comparable quantitative proportion of errors obtained on the learning data set for both cases (Fig. 14). Fig. 12. The input and output correlation matrix. The blue color represents the positive correlation while the red color represents the negative one; the deformation size correlation, of the circle (ellipsoidality) and the color saturation indicate the correlation strength (Murdoch & Chow, 1996)  A further confirmation of this comes from a comparative analysis of the real results and the ANN answers which was based on correlation graphs and a modified version of the correlation graph, taking into account the variable distribution density ( Fig. 15 and 16). In all cases under analysis we can observe that the ANN model shows a tendency to overestimate the minima and to underestimate the maxima. This weakness of ANNs was discussed in (Kozlowski, 2009). Fig. 15. The distribution of the real results and the predictions of the elongation variable for a) the data set without replacement; b) the data set after replacement The analyzed cases suggest that an ANN model taught on a data set after absent data replacement exhibits similar and at the same high values of the model prediction quality, that is, of the mean square error, as well as of the coefficient of determination R 2 . In the case of predicting A 5 , the more accurate model was the ANN model based on the data set after replacement (for R m the opposite was the case). However, the most important observable advantage following from data replacement and from the modeling with the use of the data set after replacement is the fact that the ANN model increased its prediction range with respect to the extreme output values (in spite of a few deviations and errors in their predictions). This is extremely important from the perspective of the practical model application, since it is the extreme values which are frequently the most desired (for instance, obtaining the maximum value R m with, simultaneously, the highest possible A 5 ), and, unfortunately, the sources -for objective reasons -usually do not spell out the complete data, for which these values were obtained. In spite of the small number of records characterizing the extreme outputs, an ANN model was successfully developed which can make predictions in the whole possible range of dependent variables. This may suggest that the absent data was replaced with appropriate values, reflecting the obtaining general tendencies and correlations. It should also be noted that the biggest prediction errors occur for the low values of the parameters under analysis (e.g.: for A 5 within the range 0÷1%), which may be directly connected with the inaccuracy and the noise in the measurement (e.g.: with a premature break of a tensile specimen resulting from discontinuity or nonmetallic inclusions).
The application of the proposed methodology makes possible a successful inclusion into data sets of the pieces of information coming from different sources, including also uncertain data.
An increase in the number of cases in the data set used for modeling results in the model accuracy increase, at the same time significantly widening the practical application range of the taught ANN model, via a significant increase, sometimes even doubling, of the variability range of the parameters under analysis.
Multiple works suggested that ANN models should not be used for predicting results which are beyond the learning data range. The methodology proposed above makes possible the absent data replacement and therefore, the increase of the database size. This, in turn, makes possible developing models with a significantly wider applicability range.

Conclusion
The proposed methodology makes possible the full use of imperfect data coming from various sources. Appropriate preparation of the data for modeling via, for instance, absent data replacement makes it possible to widen, directly and indirectly, the parameter variability range and to increase the set size. Models developed on such sets are high-quality models and their prediction accuracy can be more satisfactory.
The consequent wider applicability range of the model and its stronger reliability, in combination with its higher accuracy, open a way to a deeper and wider analysis of the phenomenon or process under analysis.