The results after training of the proposed network.

## 1. Introduction

This chapter presents the classification of benign and malignant breast tumor based on Fine Needle Aspiration Cytology (FNAC) and Feed forward Neural Network (FFNN) trained with Resilient Back Propagation algorithm (RBP). Five hundred and sixty nine sets of cell nuclei characteristics obtained by applying image analysis techniques to microscopic slides of FNAC samples of breast biopsy have been used in this study. These data were obtained from the University of Wisconsin Hospitals, Madison. The dataset consist of thirty features which represent the input layer to the FFNN. The FFNN will classify the input features into benign and malignant. The sensitivity, specificity and accuracy were found to be Equation 97.5%, 100% and 98.73% respectively. It can be concluded that RPB network gives fast and accurate classification and it works as promising tool for classification of breast cell nuclei.

Neural Networks (NN) derive their power due to their massively parallel structure, and an ability to learn from experience. They can be used for fairly accurate classification of input data into categories, provided they are previously trained to do so. The accuracy of the classification depends on the efficiency of training. The knowledge gained by the learning experience is stored in the form of connection weights, which are used to make decisions on fresh input ( Al-Timemy et al., 2008 ).

One computer technique under investigation is the Artificial Neural Network (ANN) (Chester, 1993). Neural networks are tools for multivariate analysis that can be used to estimate disease risk. They are able to model complex nonlinear systems with significant variable interactions.

With one million new cases in the world each year, breast cancer is the commonest malignancy in women and comprises 18% of all female cancers. In the United Kingdom, where the age standardized incidence and mortality is the highest in the world, the incidence among women aged 50 approaches two per 1000 women per year, and the disease is the single commonest cause of death among women aged 4050, accounting for about a fifth of all deaths in this age group. There are more than 14000 deaths each year, and the incidence is increasing particularly among women aged 5064, probably because of breast screening in this age group.

Of every 1000 women aged 50, two will recently have had breast cancer diagnosed and about 15 will have had a diagnosis made before the age of 50, giving a prevalence of breast cancer of nearly 2% (McPherson et al., 2000).

Conventional methods of monitoring and diagnosing the diseases rely on detecting the presence of particular features by a human observer. Due to large number of patients in intensive care units and the need for continuous observation of such conditions, several techniques for automated diagnostic systems have been developed in recent years to attempt to solve this problem. Such techniques work by transforming the mostly qualitative diagnostic criteria into a more objective quantitative feature classification problem (Ubeyli, 2007; Kordylewski et al., 2001; Kwak, & Choi, 2002; Ubeyli & Guler, 2005).

Breast cancer may be detected via a cautious study of clinical history, physical examination, and imaging with either mammography or ultrasound. However, definitive diagnosis of a breast mass can only be established through fine-needle aspiration (FNA) biopsy, core needle biopsy, or excisional biopsy (Chester, 1993). Among these methods, FNA is the easiest and fastest method of obtaining a breast biopsy, and is effective for women who have fluid-filled cysts. FNA uses a needle smaller than those used for blood tests to remove fluid, cells, and small fragments of tissue for examination under a microscope (Tingting & Nandi, 2007).

Research works on the Wisconsin diagnosis breast cancer (WDBC) data grew out of the desire of Dr. Wolberg to diagnose breast masses accurately based solely on FNA (Street et al. 1993; Wolberg et al., 1993; Wolberg et al., 1994). They applied image processing techniques to derive the WDBC dataset directly from digital scans of FNA slides (Wolberg et al., 1995). Then they employed machine learning techniques to differentiate benign from malignant samples (Mangasarian, 1995), which could be the earliest study of machine learning application to breast cancer detection. Several attempts have been proposed for diagnosis of the breast cancer that includes FFNN (Setiono, 2002), Radial Basis Function (RBF) network (Subashini et al., 2008), Self organization maps (Tingting & Nandi, 2007), fuzzy classifiers (Schaefera et al., 2008; Reyes et al., 1999; Mousa et al., 2005; Nauck, & Kruse, 1999; Abonyi & Szeifert, 2003), Linear Vector Quantization (LVQ) (Goodman et al., 2002), and Support Vector Mechanics (SVM) (Subashini et al., 2008; Akay, 2009; Polat & Gunes, 2007).

The purpose of this study is to develop a RBP network which will classify the breast biopsy samples into malignant and benign cysts based on the input features of cell nuclei characteristics. This network will act to help in the classification purposes of breast cancer.

## 2. The Classification Problem

The classification problem addressed in this study is the detection of malignant breast tumors from a set of benign and malignant samples, called the WDBC dataset, which was obtained from the University of Wisconsin Hospitals, Madison, available at http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29.

Features in this dataset were computed from digitized FNA samples (Wolberg et al., 1993; Wolberg et al., 1994; Mangasarian, 1995), as follows:

After the FNA sample was taken from a breast mass, the material was mounted on a microscope slide and stained to highlight the cellular nuclei. A portion of well differentiated cells was scanned using a digital camera. The image analysis software system Xcyt was used to isolate individual nuclei. An approximate boundary of each nucleus was provided as input and taken to convergence to the exact nuclear boundary, using a semi-automatic segmentation procedure called ‘‘snakes’’. Beginning with a user defined approximate boundary as an initialization, the snake locates the actual boundary of the cell nucleus. In order to evaluate the size, shape and texture of each cell nuclei, the following ten characteristics were derived:

(1) Radius is computed by averaging the length of radial line segments, which are lines from the center of mass of the boundary to each of the boundary points.

(2) Perimeter is measured as the sum of the distances between consecutive boundary points.

(3) Area is measured by counting the number of pixels on the interior of the boundary and adding one-half of the pixels on the perimeter, to correct for the error caused by digitization.

(4) Compactness *(COM)* combines the perimeter and area to give a measure of the compactness of the cell, calculated as

This dimensionless number is minimized for a circle and increases with the irregularity of the boundary.

(5) Smoothness *(SM)* is quantified by measuring the difference between the length of each radial line and the mean length of the two radial lines surrounding it. If this number is small relative to the distance between consecutive boundary points, then the contour is smooth in that region. To avoid the numerical instability associated with small divisors, the following equation is used to calculate the smoothness:

Where

(6) Concavity is captured by measuring the size of any indentations in the boundary of the cell nucleus.

(7) Concave point is similar to concavity, but counts only the number of boundary points lying on the concave regions of the boundary, rather than the magnitude of such concavities.

(8) Symmetry is measured by finding the relative difference in length between pairs of line segments perpendicular to the major axis of the contour of the cell nucleus. The major axis is determined by finding the longest chord, which passes from a boundary point through the center of the nucleus. The segment pairs are then drawn at regular intervals. To avoid numerically unstable results due to extremely small segments, the sums are again divided, rather than summing the quotients,

where

(9) Fractal dimension is approximated using the ‘‘coastline approximation’’ described by Mandelbrot (Mandelbrot, 1997). The perimeter of the nucleus is measured using increasingly larger ‘‘rulers’’. As the ruler size increases, the precision of the measurement decreases, and the observed perimeter decreases. Plotting these values on a log–log scale and measuring the downward slope gives the negative of an approximation to the fractal dimension.

(10) Texture is measured by finding the variance of the gray-scale intensities in the component pixels.

The mean value, standard error, and the extreme (largest or ‘‘worst’’) value of each characteristic were computed for each image, which resulted in 30 features of 569 images, yielding a database of 569 X 30 samples representing 357 benign and 212 malignant cases.

## 3. Feed-Forward Neural Networks

A neural network is a parallel distributed information processing structure consisting of processing elements (neurons) interconnected via unidirectional signal channels called connections. Each processing element has a single output connection that branches into as many collateral connections as desired (Barbălată & Leuştean, 2004).

Neural networks develop information processing capabilities by learning for examples. Learning techniques can be roughly divided into two categories (Haykin, 1994); supervised learning and unsupervised learning.

Supervised learning requires a set of examples for which the desired network response is known. The learning process consists then in adapting the network in a way that it will produce the correct response for the set of examples. The resulting network should then be able to generalize (give a good response) when presented with cases not found in the set of examples.

In unsupervised learning the neural network is autonomous; it processes the data it is presented with, finds out about some of the properties of the data set and learns to reflect these properties in its output. What exactly these properties are, that network can learn to recognize, depends on the particular network model and learning method.

One of the most popular neural network paradigms is the feed-forward neural network. In a feed-forward neural network, the neurons are usually arranged in layers (Rumelhart et al., 1986). A feed-forward neural net is denoted as,

Where:

By convention, the input layer does not count, since the input units are not processing units, they simply pass on the input vector x. Units from the hidden layers and output layer are processing units. Figure 1 gives a typical fully connected 2-layer feed-forward network with a 3X4X3 structure.

Each processing unit has an activation function that is commonly chosen to be the sigmoid function:

The net input to a processing unit j is given by:

Where

The activation value (output) of unit j is given by:

The objective of different supervised learning algorithms is the iterative optimization of a so called error function representing a measure of the performance of the network. This error function is defined as the mean square sum of differences between the values of the output units of the network and the desired target values, calculated for the whole pattern set. The error for a pattern p is given by

Where*p*. The total error is,

Where *P* is the number of the training patterns.

During the training process a set of pattern examples is used, each example consisting of a pair with the input and corresponding target output. The patterns are presented to the network sequentially, in an iterative manner, the appropriate weight corrections being performed during the process to adapt the network to the desired behavior. This iteration continues until the connection weight values allow the network to perform the required mapping. Each presentation of the whole pattern set is named an epoch.

One of the most popular supervised learning algorithms for feed-forward neural networks is Backpropagation (Rumelhart et al., 1986). In this algorithm the minimization of the error function is carried out using a gradient-descent technique. The necessary corrections to the weights of the network for each moment t are obtained by calculating the partial derivative of the network error function in relation to each weight

Where

A momentum may be used with the idea of incorporating in the present weight update some influence of the past iteration. The weight update rule becomes:

Where

## 4. The RBP Algorithm

The algorithm RBP introduced by M. Riedmiller in 1993 is a local adaptive learning scheme, performing supervised batch learning in feed-forward neural networks. The basic principle of RBP is to eliminate the harmful influence of the size of the partial derivative on the weight step. As a consequence, only the sign of the derivative is considered to indicate the direction of the weight update. To achieve this, we introduce for each weight

It is introduced a second learning rule, which determines the evolution of the update-value

Where

In words, the adaptation rule works as follows. Every time the partial derivative of the corresponding weight

Once the update-value for each weight is adapted, the weight-update itself follows a very simple rule: if the derivative is positive (increasing error), the weight is decreased by its update-value, if the derivative is negative, the update-value is added:

However, there is one exception. If the partial derivative changes sign that is the previous step was too large and the minimum was missed, the previous weight-update is reverted:

Due to that ‘backtracking’ weight-step, the derivative is supposed to change its sign once again in the following step. In order to avoid a double punishment of the update-value, there should be no adaptation of the update-value in the succeeding step. In practice this can be done by setting in the

The partial derivative of the total error is given by

Hence, the partial derivatives of the errors must be accumulated for all P training patterns. This means that the weights are updated only after the presentation of all training patterns.

## 5. The Proposed FFNN Design

In the present work, the neural networks are used for the classification purposes. Three issues need to be settled in designing an ANN for a specific application:

Topology of the network;

Training algorithm;

Neuron activation functions.

In our topology, the number of neurons in the input layer is 48 neurons for the ANN classifier. The output layer was determined by the number of classes desired. In our study, the output is either benign or malignant therefore; the output layer consists of one neuron. The hidden layer consists of twenty eight neurons. The general architecture of the proposed network is shown in Figure 2.

Before the training process is started, all the weights are initialized to small random numbers. This ensured that the classifier network is not saturated by large values of the weights. In this experiment, the training set was formed by choosing 79 data sets for the testing process.

The tangent sigmoid (tansig) function was used as the neural activation function. The most important reason for choosing the sigmoid as an activation function for our networks is that the sigmoid function f(x) is differentiable for all values of x, which allows the use of the powerful BP learning algorithm.

The proposed network was trained with all 490 cases (317 benign and 173 malignant cases). These 490 cases are fed to the FFNN with 48 input neurons, one hidden layer of 28 neurons and one output neuron. MATLAB software package version 7 is used to implement the software in the current work. When the training process is completed for the training data (490 cases), the last weights of the network were saved to be ready for the testing procedure. Learning rate is set to 0.5, the output of the network was -1 for the class benign normal and 1 for the class malignant. The training algorithm used for this network is BPA. The performance goal was met at 2600 epochs after a training time of 67 sec.

The testing process is done for 79 cases (40 benign and 39 malignant). These 79 cases are fed to the proposed network and its their output is recorded for calculation of the sensitivity, specificity and accuracy of prediction.

The accuracy of the classification depends on the efficiency of training. The knowledge gained by the learning experience is stored in the form of connection weights, which are used to make decisions on fresh input.

## 6. Results and Discussion

The performance of the algorithm was evaluated by computing the percentages of Sensitivity *(SE),* Specificity *(SP)* and Accuracy *(AC)*. The respective definitions of these parameters are as follows ( Al-Timemy, 2008 ):

Where *TP* is the number of true positives, *TN* is the number of true negatives, *FN* is the number of false negatives, and *FP* is the number of false positives. Since it is interesting to estimate the performance of classifier based on the classification of benign and malignant breast cell nuclei, the true positives *TP*, false positives *FP*, true negatives *TN*, and false negatives *FN* are defined appropriately as shown below:

*FP*: Predicts benign as malignant.

*TP*: Predicts malignant as malignant.

*FN*: Predicts malignant as benign.

*TN*: Predicts benign as begin.

In our study, the output 1 indicates normal case. If the output is 2 this means that the patient may have abnormal kidney function. Sensitivity, specificity and accuracy of prediction have been calculated according to the above formals for all of the testing data (79 cases). Table 1 shows the resulted *SE*, *SP* and *AC* for testing data proposed networks.

No of cases | SE | SP | A C | |

RBP | 79 | 97.5% | 100% | 98.73% |

From the table, the obtained accuracy means that there was only one misidentification. This is regarded a very good and the system is reliable. The results showed that the algorithm can be reliable purposes in the classification purposes.

In practice, the number of neurons in the hidden layer varies according to the specific recognition task and is determined by the complexity and amount of training data available. If too many neurons are used in the hidden layer, the network will tend to memorize the data instead of discovering the features. This will result in failing to classify new input data.

The goal error was 0.01 and the network reached this error after 2600 epochs. Figure 3 displays the error of the training of the network versus epoch’s number.

## 7. Conclusion

In this chapter, Resilient BPA has been implemented for classification of benign from malignant breast tumor. Five hundred and sixty nine sets of cell nuclei characteristics obtained by applying image analysis techniques to microscopic slides of FNAC samples of breast biopsy have been used in the current work. MATLAB software package version 7 is used to implement the software in the current work. These feature vectors which consist of thirty image analysis features each were carried out to generate training and testing of the proposed Neural Network. The accuracy is calculated to evaluate its effectiveness of the proposed network. The obtained accuracy of the network was 98.73% whereas the sensitivity and specificity were found to be Equation 100% and 98.73% respectively. It can be concluded that that the proposed system gives fast and accurate classification of breast tumors.

## Acknowledgments

The authors would like to express their thanks to Dr. William H. Wolberg, W. Nick Street and Olvi L. Mangasarian for providing Breast Cancer Wisconsin (Diagnostic) Data Set used in this work. They are with the University of Wisconsin, 1210 West Dayton St., Madison, WI 53706 USA.