## 1. Introduction

In the real world, data representation is most often imperfect, in the sense that the data may be either incomplete or redundant. Philosophers, logicians and mathematicians have dealt with this problem for a long time. In recent years, propelled by the advent of the computer, the problem of imperfect knowledge has been becoming an important topic for computer scientists engaged in artificial intelligence research, especially those involved with knowledge discovery from databases, expert systems, and pattern recognition.

Our research is focused on rough set as a tool for image processing, or more precisely, for image segmentation. Many techniques for image segmentation have been developed over time. There are clustering, edge detection, region growing and even more advanced techniques that use neural networks. In general, image segmentation techniques can be categorized as supervised or unsupervised. Supervised techniques require previously known truth data for training purposes, while unsupervised techniques have no such requirement.

In this chapter, the classical rough set theory is reviewed in section 2. Particle swarm optimization is then introduced in section 3. The Davies-Bouldin measure for cluster validity is also described in this section. The K-means algorithm is briefly sketched in section 4. Multispectral image classification using rough set theory is discussed in section 5. A hybrid algorithm which combines the K-means algorithm, rough set and particle swarm optimization is given in section 6. Experimental results are shown in section 7. The conclusion and future work then follow.

## 2. Rough Set Theory

Rough set theory [5] is a mathematical tool that deals with the uncertainty of the data. The theory consists of finite sets, equivalence relations and cardinality concepts. As the theory matures and more applications reap the benefits of the concept, an abundance of related theorems and algorithms are being incorporated to extend rough sets theory.

It was introduced by Pawlak in the early 1980’s and has been argued to overlap with other theories, such as statistics, evidence theory and fuzzy set. Furthermore, rough set is said to complement fuzzy set, a theory introduced by Zadeh in the early period. Rough set and fuzzy set were both introduced to deal with imprecise information however; fuzzy set deals with vagueness, while rough set deals with coarseness. Rough set does not need as much preliminary knowledge about the data where as fuzzy set requires knowledge of the possible values in advance. Basically, when using rough set, the data itself is used to come up with the approximation in order to deal with the imprecision within. It can therefore be considered a self-sufficient discipline.

Rough set mainly deals with data analysis in table format. The approach is generally to pre-process the data in the table and then to analyze them. Reducts are extracted with an algorithm and finally rules are generated based on the reducts. Rough set does not support analog values in the table attributes; therefore discretization must be performed in advance in order to evaluate the table. The following subsections will use a simple example to illustrate the concept of rough set theory.

### 2.1. Information Systems

In essence, an information system is a set of objects represented in a data table (attribute - value system). Each row contains an object and each column represents a measurable attribute for each object. Formally, an information system is a pair A = (U, A) where U is a non- empty finite set of objects representing the universe and A is a non-empty finite set of attributes such that a: U→V_{a} for every a ∈ A. The set V_{a} is the set of values for a.

### 2.2. Decision Systems

If an information system has an additional attribute, namely a decision attribute, then it becomes a decision system. The decision attribute is associated with the object classification outcome, and it may depend on several other attributes. Formally, a decision system is a piece of information whose form is A = (U, A ∪ {d}), where d ∈A is the decision attribute. A decision attribute called “Class” has been added as shown in Table 2, where M and E denote Manager and Employee, respectively. The table was modified from the original [12].

### 2.3. Indiscernibility

Objects in information and decision systems may be indistinguishable from one another based on a set of attributes B that belongs to A (B ⊆ A). A set of objects is indiscernible or equivalent when their attributes are related by an equivalence relation. An equivalence relation is a relation on a set B when it is:

Reflexive (if a R a, then R is reflexive).

Symmetric (if a R b then b R a, then R is symmetric).

Transitive (if a R b and b R c, then a R c, thus R is transitive).

For an information system A = (U, Α), there is an equivalence relation for any of the sets B ⊆ A. The equivalence relation can be formalized as

Referring to table 4, the *IND* relations for *Salary* can be written as shown.

It is impossible to write the *IND* relations for *Salary* until discretization is completed.

### 2.4. Discretization

Discretization is not directly related to rough set theory. It is simply a preprocessing technique. Discretization is associated with information loss. In general, when it is too coarse (i.e. longer interval), there is too much information loss or noise in the data. However, it is better for the classification capability of unseen objects. When the discretization is more fine (i.e. shorter interval), less noise exists in data, but classification capability of unseen objects may be impaired.

In our decision system table, both Salary and Age need to be discretized. The set of possible Salary and Age values, respectively referred to as *s* and *a* from here on, is given by

The lower and upper bounds of the attribute’s interval are extended to cover possible values. For example, the Age attribute is extended to include likely working ages from age 18 through 65.

The set of values of *s* and *a* in U is

The intervals obtained for *s* are

The intervals obtained for *a* are

Boundary intervals such as [15, 30) and [100, 120) should not be used since one can not discern anything for this data set.

The intervals introduce a set of cuts, which are defined as (s, c) where c ∈V_{s} and (a, c) where c ∈ V_{a}. If the cut is taken based on the mid-point of each interval, the set of cuts P obtained for *s* and *a* are respectively

The next step is to find the set of minimal cuts that can discern all of the objects that are needed. It turns out that the problem of finding the irreducible set of cuts P in the decision system is NP-complete while the effort to find the optimal set of cuts P in a decision system is NP-hard [5].

However, there are heuristics that can be used to find the optimal set of cuts P in practical time. One of them is the Maximal Discernability heuristic [1], [5], which is demonstrated here. The algorithm to construct table A* from A is listed in the following steps:

Each column in table A* is a Boolean variable of the corresponding column in A. If each pair of objects can be discerned by the Boolean variable, then assign value 1, else assign 0.

Choose a column from A* that has a maximal number of 1’s and delete all the rows which contain a 1 in the selected column.

Repeat step 2 and continue until all columns and rows are consumed.

The following example clarifies the process of constructing table A* from A mentioned in step 1 of the algorithm. Each cut previously obtained is assigned a Boolean variable, which in turn is used as a condition attribute in table A*.

For example, (s, 40) is assigned Boolean variable
_{1} and u_{3} are paired up since they have different decision values (i.e. M and E).

The resulting table A* created from A is shown in Table 3.

The optimal cut chosen is (s, 65), (a, 32.5) and (a, 50). These cuts are then used to discretize the decision table 2. The rank can be assigned using the following rules:

If s < 65, value 0 is assigned to s, else assign value of 1.

If a < 32.5, value 0 is assigned to a.

If (32.5 ≤ a < 50), value 1 is assigned to a.

If a ≥ 50, value 2 is assigned to a.

A discretized table can be produced by applying the condition to each analog value in the table.

### 2.5. Lower and Upper Approximations in Rough Set and Accuracy

Let U be the non-empty finite set and R be an equivalence relation. The pair A = (U, R) is an approximation space. The equivalence relation R on U leads to a partition of the objects in the universe U. The idea here is to partition the objects that have the same outcome, or in other words, to partition objects that have the same decision attribute. However, this may not always be as easy as stated. There will be objects with the same condition attributes (in the same equivalence class), but different decision attributes. Therefore one can not define every set precisely.

In cases where the set can not be defined precisely, it can be approximated. This is where rough set emerges. Let us assume that there is an information system A = (U, Α), a set of attributes B ⊆ A, and a set of objects X ⊆ U. Using the set of attributes B, one can approximate the objects X into:

1. Lower Approximation: the set of objects that can be classified as member X with certainty. Formally stated as

2. Upper Approximation: the set of objects that can possibly be classified as a member X. Formally stated as

Between the lower and upper approximation, one can define the set of objects that cannot be classified into X decisively. This set is also known as the *B-boundary region* of X.

There is a coefficient that reflects the accuracy of approximation,

where |X| denotes the cardinality of X ≠ ∅. When

For our example, the boundary region would be for object *u*
_{
4
} and *u*
_{
6
} since they can not be discerned. The lower and upper approximations can be written as

In general, the value of

### 2.6. Reducts

One way to increase computation efficiency is to reduce the size of data by reducing attributes that need to be taken into account. Only attributes that do not contribute to the classification result can be omitted such that the indiscernibility relation remains intact. The set of remaining attributes is the minimal set and is called a reduct.

Although finding the equivalence class is a relatively straightforward computation process, finding reducts with minimal attributes is known to be NP-hard. Fortunately, there are heuristics that allow minimal reducts to be computed in reasonable time.

### 2.7. Discernibility Matrix

Computing the reducts of an information system A = (U, A) can be started by creating the indiscernibility matrix. This matrix is a symmetric *n* x *n* matrix where each entry

Each cell in the matrix holds the set of attributes where objects

### 2.8. Discernibility Functions

Based on the discernibility matrix, a discernibility function can be immediately obtained. It is constructed using Boolean expressions from the discernibility matrix, defined as

where

Once the discernibility function

### 2.9. Decision Rules

When applying rough set for supervised learning, we need to construct a set of rules from the training data, such that new or unseen objects can be separated into known classes.

A basic method for forming the decision rules is begun by finding the reducts of the decision table. Then for each reducts

Rule induction is about deciding which attributes should be included in the predecessor of the rule. Rules obtained can always be minimized, but it will introduce noise and may poorly classify the unseen objects.

Once the rules are obtained, they can be used to classify the objects that were unseen before. The basic steps involved can be outlined as follows [1].

Apply the existing rules to the new objects so that it can determine which rules actually are a fit to the new objects.

If none of the rules are matched, then fallback a must be chosen, or the objects would be classified as undefined.

If more than one rule is applicable, then a negotiation among the rules must be performed to decide which one to be used.

For the discernibility function extracted from the decision table 2, we obtain the following sets of decision rules by:

## 3. Particle Swarm Optimization

PSO was originally introduced by Kennedy and Eberhart [21]. The algorithm was inspired by a sociological observation of a flock of birds behavior while searching for food. Each member of the flock moves with a direction and speed influence by its own previous state and that of the as a whole flock.

PSO consists of a swarm (collection) of particles searching through the solution space. Each particle holds information that can potentially become the solution. Each particle has a position and velocity that are mutually affecting those of other particles. Each particle will adjust its parameter according to the swarm’s best outcome, while still considering its own experience. Therefore, at any instance, the following information is maintained by each particle.

*x*_{ i }, the current position of the particle;*v*_{ i }, the current velocity of the particle; and*y*_{ i }, the personal best position of the particle (*pbest*); the best position visited so far by the particle.*ŷ*, the global best position of the swarm (*gbest*); the best position visited so far by the entire swarm.

The search performed by the swarm is either to maximize or minimize the objective function *f(x)*. The personal best position (*pbest*) is obtained by evaluating the following.

The global best position (*gbest*) is obtained by using

After each iteration, the current position (*x*
_{
i
}) and velocity (*v*
_{
i
}) are recalculated using

where *ω* is the inertia weight which reflects the memory of previous velocities. *y*
_{
i
}(*t*) – *x*
_{
i
} (cognitive component) represents the particle’s own experience as to where the best solution is. *ŷ*(*t*)–*x*
_{
i
} (social component) represents the direction of the entire swarm towards the best solution. The *c*
_{
1
} and *c*
_{
2
} are acceleration constants. *r*
_{
1
}(*t*), *r*
_{
2
}(*t*) are in the distribution of *U*(0,1) which will be a random number between 0 and 1.

In image classification, the PSO algorithm is used to optimize the objective functions that are mainly to:

Minimize the distance between pixels and cluster means for each cluster.

Maximize the distance between clusters.

In unsupervised training, there is no prior knowledge of the number of clusters. Therefore the cluster validity is determined by the objective functions. In the algorithm, the Davies-Bouldin index is used as the means to evaluate the result of each iteration.

### 3.1. Cluster Validity – Davies-Bouldin Index.

The accuracy or validity of the classification results need to be measured using certain criteria. As a prerequisite, a set of objects needs to possess a natural group structure. In our image classification algorithm outlined in section 5, the Davies-Bouldin (DB) index is used as the aid in parameter tuning. Our objective function is to minimize the DB index, since a smaller index value indicates compact and well-separated clusters. The similarity index between two clusters *C*
_{
i and}
*C*
_{
j
} can be expressed as [17]

where *s*
_{
i
} and *s*
_{
j
} are a measure of distance within a cluster, and *d*
_{
ij
} is the distance between cluster *i* and *j*. The *s*
_{
i
} is defined as [17]

where *n*
_{
i
} is the number of pixels in the cluster *C*
_{
i.
}The distance between two clusters *d*
_{
ij
} is defined as [17]

where *l* is the number of clusters and *m* represents the mean distance.

Let *R*
_{
i
} be defined as [17]

Then the DB index is defined as [17]

## 4. The K-means algorithm for multispectral image classification

The K-means algorithm is one of the simplest and most efficient unsupervised learning algorithms to solve clustering problems in image segmentation. In this algorithm, random cluster means are assigned and repeatedly modified throughout the process in order to minimize the squared error function. Suppose there are

Upon the completion of the assignment, each new cluster mean is calculated using

where

The weakness of K-means is that it is dependent on the initial selection of the cluster means and it may be trapped into locally optimal results. However, running the algorithm repeatedly and randomly selecting different sets of cluster means may offset the problem. In a paper by Hung and Germany [19] it is shown that the local optimal results may also be avoided by assigning the cluster means based on distribution of patterns in histogram of an image.

## 5. Multispectral Image Classification using Rough Set Theory

Multi-spectral images can be analyzed using rough set theory. However, since all the attribute values are analog, the discretization process is required. Multispectral images contain multiple bands, for example the RGB color band.

Object Index | R | G | B | Class |

u 1 | 149 | 148 | 143 | 1 |

u 2 | 154 | 155 | 150 | 1 |

u 3 | 159 | 160 | 155 | 1 |

u 4 | 174 | 171 | 164 | 2 |

u 5 | 164 | 161 | 154 | 2 |

u 6 | 179 | 183 | 186 | 3 |

u 7 | 159 | 165 | 163 | 3 |

u 8 | 178 | 184 | 182 | 3 |

The values of the condition attributes are obtained from the image data shown in figure 1, while the values of the decision attributes are obtained from 'ground truth' data..

Each object has three condition attributes, Red (R), Green (G) and Blue (B) which are associated with a decision attribute. The decision attributes signify the following:

The value of each attribute ranges from 0 to 255, hence the training data from Table 5 can be expressed as:

V_{R}={0, 149, 154, 159, 164, 174, 178, 179, 255}

V_{G}={0, 148, 155, 160, 161, 165, 171, 183, 184, 255}

V_{B}={0, 143, 150, 154, 155, 163, 164, 182, 186, 255}

Based on the above intervals, the following set of cuts are obtained.

For the R attribute:

(r, 151.5); (r, 156.5); (r, 161.5); (r, 169); (r,176); (r,178.5)

For the G attribute:

(g, 151.5); (g, 157.5); (g, 160.5); (g, 163); (g, 168); (g, 177); (g, 183.5)

For the B attribute:

(b, 146.5); (b, 152); (b, 154.4); (b, 159); (b,163.5); (b, 173); (b, 184)

The optimal set of cuts needs to be selected now. There are many ways to perform the selection. For decision table A = (U, A ∪ {d}), a local method can be used as [1]:

*Input:* The consistent decision table A.

*Output:* The semi-minimal set of cuts D consistent with A.

*Method:* Initialize the binary tree variable T with the empty tree. Label the root by the set of all objects U and fix the status of the root to be unready.

By applying the algorithm above to the image data as shown in Table 6, the following details are derived. For each cut of the R, G and B attributes, we find the cut that yields the maximum number of pairs. The search gives us (g, 160.5) as the optimal solution which yields 15 pairs.

The cut (g, 160.5) divides the set into two, X_{1} = {u_{1}, u_{2}, u_{3}} and X_{2} = {u_{4}, u_{5}, u_{6}, u_{7}, u_{8}}. Notice that X_{1} actually consists of objects of the same class, so the search ends. The search continues for X_{2}. Three sets of cuts are found from the R, G and B attributes for X_{2}. All of the cuts, (r, 176), (g, 177) and (b, 173) yield the same number of objects (4 pairs). We only need to select one, and the one chosen is (r, 176). Again, this cut divides the set into two, Y_{1} = {u_{4}, u_{5}, u_{7}} and Y_{2} = {u_{6}, u_{8}}. Y_{2} consists of objects of the same class, so the search ends. The search continues for Y_{1}. The cut that can discern the most from Y_{1} is (r, 161.5).

The cut (r, 161.5) divides Y_{1} into two sets, Z_{1} = {u_{4}, u_{5}} and Z_{2} = {u_{7}}. The search ends since both sets contain objects of the same class.

The set of cuts selected are:

It appears that our data set only requires two attributes to be fully discerned. Note that different discretization methods will obtain different results. For example, if a naïve algorithm was used, the B attribute will be considered in generating the cuts.

Using the cuts, a discretized table is subsequently generated. The asterisk in the B column indicates that it is not needed to discern the classes. This, however, will not be the case when the training set grows larger.

Based on Table 6, the following rules are generated:

## 6. The Hybrid Rough K-means Algorithm and Particle Swarm Optimization for Multispectral Image Classification

The K-means clustering method is categorized as a hard clustering method. Using K-means to classify images that have obscured or blurred boundaries will not bring a satisfactory result. There are many methods proposed to deal with this. The fuzzy C-means [22] and genetic K-means [23] algorithms are two examples.

Rough K-means is a recently proposed method that deals with the coarseness of the information. In gray image classification the challenge is on segmenting the blurred boundaries between clusters. Using rough sets theory, an image can be represented as sets of lower and upper approximation. The rough K-means model for our proposed image segmentation algorithm is adapted from [20].

(18) |

Each image pixel can be classified into lower or upper approximations. Following basic rough set properties:

A pixel can be part of only one lower approximation

If a pixel is part of a lower approximation, then it is also part of the upper approximation

If a pixel does not belong to any lower approximation, then it belongs to two or more upper approximations.

Applying rough set into K-means requires the formula to include lower and upper approximations. The formula, as shown below, includes the weighing factor *w*
_{lower} and *w*
_{upper.} Let *v* be a pixel vector and
*i*. Let

and

In order to correctly classify a pixel, the following *classification criteria* are used:

If

*T*is not an empty set, then the pixel is classified as an upper approximation of both clusters*i*and*j*.If

*T*is an empty set, the pixel is classified as a lower approximation for cluster*i*. It will also be classified as an upper approximation for cluster*i*.

To summarize, the following are steps to perform the rough K-means algorithm [26]:

Initialize K clusters randomly.

Select

*w*_{lower}and a threshold value.For each cluster, find

*d*using Equation 6.2 and*T*using Equation 6.3.Classify the pixel using the

*classification criteria*.Calculate the new cluster center (mean) using Equation 6.1.

If every cluster converges, then stop. Otherwise, repeat step 3.

The parameters involved are *w*
_{lower}, *w*
_{upper} and the threshold. The sum of *w*
_{lower} and *w*
_{upper} will always be one. These parameters are set manually by trial and error. Since it is not trivial to come up with good parameter values, this is the major disadvantage for this method. In order to adjust these parameters automatically, this algorithm needs to be improved using automatic tuning mechanism. The PSO algorithm alleviates the limitation by automatically searching and modifying the parameters during the image segmentation process.

The proposed algorithm that combines rough K-means and PSO algorithm is outlined as follows [26]:

Initialize the mean of each cluster.

Initialize a number of particles where each of the particles is randomly assigned with

*w*_{ lower }and the threshold.Find the minimum pair of distance of

*x*to all clusters,*d*(*x-c*_{ i }). Then assign the pixel according to the following criteria.Calculate the DB index of each particle. Save the DB index of each particle and compare them with those of other particles. Find the global best index and tune the lower approximation and thresholds of each particle according to the following guidelines.

If the personal best DB index equals the global best DB index, then lower the threshold so that it includes only the pixels that are definitely in the lower approximation.

If the personal best DB index is greater than the global best DB index, then adjust the

*w*_{lower}and the threshold toward the particle with the global best DB index.

Calculate the new mean for each cluster.

Repeat steps 3, 4 and 5 until all particles converge.

## 7. Experimental Results

To test the effectiveness of the proposed algorithms, multispectral and artificial images were used in our experiments. The original image is processed to obtain the multispectral information. Then the Rough Set Exploration System (RSES) software was used to process the image data [24]. A selected percentage of the image pixels were sampled for training purpose. Finally MATLAB was used to make the results viewable as an image. Experimental results are described in section 7.1. The experiment on the rough K-means algorithm is intended to show the effect of parameter selection on the results of the classification. Experimental results on the algorithm are shown in section 7.2.

### 7.1. Experimental Results on the Rough Set Theory

Due to the size of the table in our training sample (in the range of over 80,000 pixels), we need to resort to the decomposition tree feature of the RSES. This feature allows us to break the table into sections no larger than a predefined size. In this case, a size of 500 samples is selected as the maximum size of each leaf in the decomposition tree. These methods are further elaborated in [3] and [4]. During the decomposition process, the table is also discretized. A local method like the one outlined in Section 2 is chosen as the method for selecting the optimal cuts. Each leaf of the decomposition tree contains a set of rules that was dynamically created. The rules are then used to classify the unseen objects. Using RSES, there are two formats of output that the user can select: confusion matrix or classification results in table format.

After applying the rules to the pixels and obtaining the classification result, the reverse process is done using MATLAB to get the classified image. All pixels, including the unclassified ones, are assigned a specific color for visualization. The original image as shown in Figure 1(a) is a terrain image that has land, water and village. After the classification using rough set theory, the classified result is obtained in Figure 1(b). The confusion matrix with an average accuracy of 0.79 is shown in Table 7.

With the parallelepiped classification algorithm [25], the ordering of the classes affects the final result. The experimental results are shown in Figure 2. First we show the result of classification, where the order of classes are 1, 2, and 3 (respectively land, village and water). It is apparent that the RGB spectral signatures for village and water overlap. Since the order of analysis begins with village (2), most pixels of water (3) were classified as village. The confusion matrix for the results in Figure 2 is shown in Table 8 with an average accuracy of 0.6.

Ordering of the classification for classes 1, 2, 3 (land, village and water)

The classification ordering is class 1, 2, and 3 (land, village, and water)

The experiment is repeated for the parallelepiped classifier. The ordering is now started with classes 1, 3 and 2 (respectively land, water and village). Contrary to the result in Figure 2, now most pixels of the village area are classified as water. The confusion matrix for the result in Figure 3 is shown in Table 9 with an average accuracy of 0.72. The increase in the average accuracy, because of the misclassification of village, is mitigated by the number of its pixels overall.

The classification ordering is class 1, 3, and 2 (land, water, and village)

The following experiment requires ground truth data for accuracy assessment. The remote image sensing truth data was obtained from Dr. Su in the National Central University in Taiwan, while the ground truth data for the artificial images were created using custom software written in Java. The decision rules, which are required for classification of the image, are facilitated by RSES [24]. The process for remotely sensed images begins by sampling 30% of the image pixels as training data to create decision rules. The process to create decision rules follows the outline in Section 2. After obtaining the rules with RSES, they are used to classify the image. The image consists of approximately 262,000 pixels. Referring to Figure 4(a), class 1 is land, class 2 village and class 3 water. The average accuracy for the classification is 79%. The confusion matrix is shown in Table 10. The most difficult pixels to classify are the village pixels, as indicated by the small value of its true positive rate. The ground truth data and classified result are shown in Figure (b) and (c).

The other experiment is performed on the artificial image that consists of several shapes, namely, a cube, a serpentine, two airbrush shapes and a round shape (Figure 5). Similarly, 30 % of the pixels in the image are used for training. After obtaining the decision rules, the image is classified. The artificial image has a total of 10000 pixels. Referring to Table 11, class 1 is the cube, class 2 is the connector of the airbrush images, class 3 is the airbrush images, class 4 is the round shape and class 5 is the background. Some difficulties occur while trying to obtain the ground truth, due to the inherent limitations of the image processing software. The results however, indicate that class 3, the airbrush shapes, has the most incorrectly identified pixels. The total accuracy is still about 99% as shown in Table 11.

### 7.2. Experimental Results of the Hybrid Rough K-Means and PSO

In the experiment shown in Figure 6.1 (b), the parameter *w*
_{
lower
} is set to 0.55 and the threshold 0.45. The first parameter weights how much the previous calculated mean will affect the new mean and the second parameter adjusts the boundary region. In other words, the second parameter is the criteria limiting whether a pixel should be included in the upper approximation of a class. The higher the value of the threshold, the more less the criteria is constrained.

The experiment was done for several different combinations of *w*
_{
lower
} and the threshold value. After careful inspection on the results shown in Figure 6(b) through Figure 6(f), it turns out that a monotonic increase or decrease of *w*
_{lower} and the threshold does not guarantee improvement in the classification results. From Figure 6(b) to (c), the accuracy decreased. Although the threshold was reduced, the boundary area between land, village and river actually turns blurred. In the result of Figure 6(d) the accuracy improves. Also, from Figure 6(d) to (e) the accuracy decreases again, although not as badly as between Figure 6(b) to (c). The accuracy improves again in the results of Figure 6(f). These are strong indications that varying the parameters (*w*
_{lower} and the threshold) do not guarantee that the best results can be predicted easily. As a matter of fact, the most optimal parameters can only be found empirically. This is exactly the shortcoming of the rough K-means algorithm and the problem is addressed using PSO to tune the parameters.

Results with similar consistency are obtained for the image in Figure 7. From Figure 7(b) to 7(c), we can see improvement visually. As the parameters change in one direction, the accuracy drops as shown in Figure 7(d). Finally the best value for the experiment is shown in figure 7(f). Looking at those classified results, it may lead us into thinking that increasing the w_{lower} and decreasing the threshold gives a better result. That is not necessarily the case, since doing so means that we are counting on the lower bound more and reducing the threshold value, while at the same time discounting the upper bound. At the extreme, where w_{lower} is almost 1 and threshold is almost zero, roughness is actually removed, and the set becomes crisp. This is also formulated in Equations 6.1, 6.2 and 6.3 earlier.

Experiments using the K-means and rough K-means PSO algorithms are performed. For the comparison, the number of iteration is limited to 50 and the tolerance is set to 0.001. The result shown in Figure 8 is selected from the best outcome of 20 runs of the K-means and rough K-means algorithms. For the rough K-means PSO, 10 particles are used to explore the search space. Comparing the results of the K-means, rough K-means and rough K-means PSO algorithms, it reveals that although the improvement can be made, it is in the order of more or less 5 %. It is not very significant, but we should note that the rough K-means PSO achieve the optimal results independent of initial mean selections.

While running, the algorithm is tuned by keeping track of the DB index and adjusting the PSO particle accordingly to calculate the new mean. Figure 9 shows the DB index tracking for Figure 8(a) and 8(c).

Referring to Figure 9, it is apparent that the K-means algorithm eventually converges and locks into a certain mean value. The rough K-means PSO shows a better capability to search for solutions, because there are about 10 particles to keep track of global best and adjust the velocity towards the best solution in every iteration. Similar results are obtained from the remaining tests performed. The resulting improvement, however, is not as obvious as those shown in the artificial image (Figure 10). Part of the reason is because the artificial image does not have enough roughness. Hence, it is not difficult for K-means to perform well in this case. Figure 11 shows the tracking of the DB index.

For the planet image, there is no ground truth data available. However, the visual inspection reveals improvement. Based on the results shown in Figure 12, we can see that the K-means algorithm actually has some difficulty in the segmentation of the blurred or rough boundaries. The rough K-means PSO however, appears to be able to discern the rough boundaries, and therefore comes up with a much more rounded shape for the planet. The outer shape of the planet appears sharper, more rounded and less distorted. Figure 13 shows the DB index tracking of the planet image.

## 8. Conclusions and Future Work

Image classification and segmentation by applying rough set theory may be approached from two different perspectives: unsupervised or supervised methods. From the experiment, it is generally shown that a supervised classification achieves better results as compared with the unsupervised methods. However, it should be noted that unsupervised classification may be preferred because it requires less prior knowledge. The K-means algorithm can be enhanced by using the rough set theory for image classification, however it has a practical limitation by itself, since the parameters (w_{lower} and the threshold value) are difficult to tune manually. To solve this problem, the PSO is used for tuning these parameters. The algorithm is tested on several and in general its improved noise immunity can be seen in the results especially when the images have rough boundaries or noisy details. For future work, the ant colony optimization and differential evolution algorithms will be explored for tuning the parameters in the rough K-Means algorithm.