Comparison of related work.
Web services clustering is the task of extracting and selecting the features from a collection of Web services and forming groups of closely related services. The implementation of novel and efficient algorithms for Web services clustering is relevant for the organization of service repositories on the Web. Counting with well-organized collections of Web services promotes the efficiency of Web service discovery, search, selection, substitution, and invocation. In recent years, methods inspired by nature using biological analogies have been adapted for clustering problems, among which genetic algorithms, evolutionary strategies, and algorithms that imitate the behavior of some animal species have been implemented. Computation inspired by nature aims at imitating the steps that nature has developed and adapting them to find a solution of a given problem. In this chapter, we investigate how biologically inspired clustering methods can be applied to clustering Web services and present a hybrid approach for Web services clustering using the Artificial Bee Colony (ABC) algorithm, K-means, and Consensus. This hybrid algorithm was implemented, and a series of experiments were conducted using three collections of Web services. Results of the experiments show that the solution approach is adequate and efficient to carry out the clustering of very large collections of Web services.
- artificial bee colony
- hybrid algorithms
- Web services clustering
- semantic similarity measures
Web services clustering is the task of selecting and extracting the features of a collection of Web services, discovering the similarities between them to form groups or classes considering those features. The implementation of novel and efficient algorithms for the automatic clustering of Web services is relevant for the organization of large collections of services in private or public network such as the Internet. Having a directory of Web services organized in groups according to one or more characteristics represents an advantage during the search, selection, invocation, substitution, and composition of Web services.
Methods inspired by nature using biological analogies have been adapted for clustering problems, among which genetic algorithms, evolutionary strategies, and algorithms that imitate the behavior of some animal species have been implemented. Living beings such as animals and plants and even the climate exhibit extraordinary, complex, and fascinating natural phenomena. Of particular interest is the intelligent behavior of some animal species to find a solution to solve a problem and maintain the perfect balance of the environment surrounding.
This is the main idea of computation inspired by nature, that is, to imitate the steps that nature has developed and adapt them to find a solution of a problem, thus converting it into a bio-inspired algorithm. This is the main reason for the implementation of an algorithm that exploits the collective intelligence of a Bee Colony as an alternative for clustering. This chapter presents an innovative approach for Web services clustering using a hybrid algorithm based on the Artificial Bee Colony, K-means, and Consensus.
The work described in this chapter is part of a research project whose objective is to design and implement a semantic directory of Web services using efficient, fully automated methods to allow the organization, composition, and classification of Web services with a semantic approach.
Figure 1 shows the general architecture of the semantic directory and a methodology to construct and manage the directory. The methodology consists of the following phases:
Public Web service retrieval aims at searching over the Internet to find and copy Web service descriptions (files formatted in WSDL description language) in a local file directory. Web service retrieval is executed through crawlers designed specifically to parse and identify links to files in WSDL.
Extraction and analysis of Web services consists of parsing every Web service description file and the extraction of specific data, such as: method names, input and output parameters, and port types to facilitate the automatic invocation. Extracted data is stored in an ontology model. For this phase, we use a tool which transforms WSDL files into an ontological representation.
Web service similarity calculation is an important phase of the methodology, because classification and clustering of Web services requires the calculation of distances between services. In order to calculate similarities, different measures can be implemented and combined to obtain better results.
Web service classification and clustering consists of selecting and extracting the characteristics of a collection of Web services and discovering the similarities between them to form groups or classes considering those characteristics. Therefore, this phase depends on the results of the previous phases. In particular, in this chapter a hybrid clustering algorithm based on the Artificial Bee Colony algorithm, the K-means, and Consensus is described.
Inference and reasoning represents the supporting mechanisms to exploit and utilize the enriched Web service ontologies.
2. Related work
Web service classification and clustering is a topic addressed form different perspectives such as statistical, stochastic, and novel approaches based on bio-inspired algorithms. Additionally, semantic approaches to describe, discover, and invoke Web services have been studied to propose novel clustering algorithms. In this section a revision of related work is presented considering two trends: reported work that address clustering and classification of Web services and clustering approaches based on bio-inspired algorithms. Table 1 presents the main characteristics of related work.
|Work||Input data||Clustering approach||Use of ontologies||Similarity approach||Number of Web services used||Service repository||Benefits||Limitations|
|Liang et al. ||WSDL documents||Incremental K-means algorithm|
|No||Tree-based structure matching||352||This approach clusters Web service documents||The similarity approach is not semantic, and the clustering method is not bio-inspired|
|Platzer et al. ||WSDL documents||Statistical clustering analysis||No||Euclidean distance||275||This approach clusters Web service documents||The clustering method is not based on novel bio-inspired algorithms.|
Similarity measure is not semantic
|Pop et al. ||WSDL documents extracted from OWL-TC4||Particle swarm and ant-based service clustering||Yes||Semantic similarity by evaluating the Degree of Match (DoM)||894||SAWSDL-TC collection||The solution approach is very similar to the approach described in this chapter. The main difference is on the algorithms utilized||The semantic similarity does not use a lexical database to improve similarity measures|
|Du et al. ||WSDL documents extracted from OWL-TC4||Bottom-up hierarchical clustering||No||Semantic similarity based on WordNet||1075||OWL-TC4 collection||This work is closely related with the approach presented in this chapter||The clustering method is not based on novel bio-inspired algorithms|
|Wu et al. ||WSDL documents||K-means||No||The similarity integrates all the feature measures using a weighed sum||15,968||Seekda||This approach clusters Web service documents||The clustering method is not based on bio-inspired algorithms. Similarity measure is not semantic|
|Prakash and Singh ||Data is based on three real and one synthetic data sets||Genetic Algorithm, Differential Evolution, Particle Swarm Optimization, and Artificial Bee Colony||No||Not specified||Not for Web services||No||The clustering approach is based on novel bio-inspired algorithms.||The clustering method is not applied to Web services|
|Sahoo ||Data is downloaded from UCI repository||Two-step ABC algorithm||No||Euclidean distance||Not for Web services||Data sets are downloaded from the UCI repository||The clustering approach is based on novel bio-inspired algorithms||The clustering method is not applied to Web services. Similarity measure is not semantic|
|Kotekar and Kamath ||WSDL documents extracted from OWL-TC4||Cat Swarm Optimization (CSO)||No||Euclidean distance and TF-IDF||1083||OWL-TC4 collection||The clustering approach is based on novel bio-inspired algorithms||Similarity measure is not semantic|
In 2009, Liang et al.  proposed a method for Web service categorization considering keywords and semantic relations between elements of the description. Their proposed methodology involves preprocessing WSDL documents, rough clustering by labeling Web services with class tag, and fine clustering.
In 2009, Platzer et al.  described a scalable approach for clustering very large service repositories. They use a statistical clustering algorithm enhancing a vector space to support the search of services related to a given query.
In 2012, Pop et al.  presented two approaches for service clustering, one inspired by the behavior of the birds and other inspired by the behavior of ants. They implemented methods to evaluate the semantic similarity between services.
In 2013, Du et al.  presented an approach for clustering Web services based on functional similarity and refinement of clusters using a concept position vector.
In 2014, Wu et al.  presented an approach which consists of three modules: data preprocessing, Web service tag recommendation, and Web services clustering. The first module consists of building a content vector formed with nouns, verbs, or adjectives. Authors use different features and different approaches for similarity computation. For content use the normalized Google distance, for data types and messages they use a basic match similarity, and for tag similarity they apply the Jaccard coefficient.
In 2014, Prakash and Singh  compared the performance of evolutionary algorithms: Genetic Algorithm, Differential Evolution, Particle Swarm Optimization, and Artificial Bee Colony for clustering three real and one synthetic data sets.
In 2017, Sahoo  presented a two-step ABC algorithm for data clustering problems. Authors addressed the three problems of the ABC algorithm such as initial positions of food sources, solution search equation, and abandoned food location.
In 2018, Kotekar and Kamath  described a Web services clustering approach based on Cat Swarm Optimization (CSO), which emulates the social behavior of cats in nature.
Automated Web services clustering is useful to facilitate service search, service discovery, service composition, and service substitution. Of particular interest is the representation of Web services through ontologies because the purpose of this work is the automatic organization of any collection (public or private) of Web service in ontologies and their semantic enrichment by classification and clustering.
3. Clustering process
Clustering of Web services consist of partitioning the set of Web services in the collection into an appropriate number of clusters based on a similarity measure. Therefore, services in the same cluster are more similar than the services in the different clusters .
In this section the clustering approach implemented is described. This process has as input a collection of Web services formatted according to Web Service Description Language (WSDL). This collection of Web services is processed utilizing specific parsers to extract the most important data of the service description, which are the method names and input and output parameters. The detailed process is described in the following subsections (Figure 2).
3.1 Extracting and parsing
Every Web service description includes the definition of the programming interfaces to be invoked remotely. Figure 3 shows the abstract service interface definition; from this, the elements extracted for similarity calculation are the name of the operations and their associated input and output parameters.
4. Semantic similarity measures
Measuring the similarity between two concepts is not a new topic. Throughout the last decades, many measures of similarity have been reported using different perspectives: syntactic, semantic, contextual, etc. In this work, we use a set of semantic similarity measurements based on WordNet.1 Computing similarity between all Web services in the collection is a process executed in pairs. Let W be the tuple that represents all Web services in the collection as follows:
where P, represents all operation names; I, is the set of input parameters; O, is the set of output parameters.
In particular, in this work the similarity measures were applied only on the operation names. Therefore, the similarity calculation takes as input a matrix of all operation names in the collection of Web services, that is, as follows:
Eight measures that exploit WordNet database were used to calculate the semantic similarity between Web service operations. WordNet is a lexical database available online; it is organized into five categories: nouns, verbs, adjectives, adverbs, and function words . The utilization of WordNet for semantic similarity measurements is a good approach in contrast with the traditional syntactic similarity approaches, specifically in the case of service operations, as they normally include a verb indicating the main functionality of the operation method.
Additionally, an application programming interface (API) that implements a large collection of semantic similarity measures (140 methods) is available WNetSSAPI2. A deeper analysis and comparison of similarity measures is out of the scope of this work. Table 2 shows a summary of the semantic similarity measures used.
|Semantic similarity measure||Description||Approach||Advantage|
Banerjee and Pedersen 
|Address word sense disambiguation by counting overlaps between dictionary definitions||Corpus-based approach||It is not a syntactic technique, and it is not dependent on global information|
|Wu and Palmer ||Path length to the root node from the least common super-concept of two concepts||Taxonomic-based approach||It is a syntactic-semantic technique|
|Resnik ||Evaluate the semantic similarity in a taxonomy, based on the notion of information content||It is a hybrid approach that combines corpus-based statistical methods with knowledge-based taxonomic information||It is a semantic technique|
|Jian and Conrath ||It combines a lexical taxonomy structure with corpus statistical information||It is a hybrid approach that combines taxonomic-based approach with corpus-based approach||Based on the tests reported, this combined approach outperforms other computational models|
|Lin ||This measure uses the amount of information needed to state the commonality between the two concepts and the information needed to describe these terms||Information content measure (corpus-based)||It is a universally applicable similarity measure, independent of domain or form of knowledge representation|
|Hirst Onge ||This measure states that two lexicalized concepts are semantically close if their synonyms are connected by a path that is not too long, and it is not changing its direction frequently||Information content measure that uses lexical chains as a context||It is a semantic and context-based technique|
|Leacock and Chodorow ||This measure finds the shortest path length between two concepts, and scales that value by the maximum path length in the is-A hierarchy in which they occur||This is an information content measure that adds topical to local context using a statistical classifier||It is a semantic and context-based technique|
With these measures, all service operations are compared, and a set of eight matrixes are created with the distances between them. Figure 4 shows an example of the calculation of the eight similarities with operation names.
5. Artificial Bee Colony algorithm
The Artificial Bee Colony (ABC) algorithm is an optimization technique that simulates the foraging behavior of honey bees and has been successfully applied to various practical problems and is a nature-inspired and swarm intelligence method that has been applied in different scenarios with good results. The ABC algorithm was proposed in 2005 by Karaboga [18, 19]; accordingly, the collective intelligence model of the Bee Colony consists of:
Employed foragers which are bees assigned (employed) to a particular food source and are exploiting it. They carry information about the food source, distance and direction to the nest and the profitability of the source, and are capable of sharing this information.
Unemployed foragers are bees that are continuously searching for food sources. These unemployed bees are subdivided into scouts, bees that search on the surrounding environment for new food sources, and onlookers, bees that wait in the nest.
The ABC algorithm has different modes of behavior:
Exploration is the task executed by unemployed bees to find new food sources.
Exploitation is the task executed by employed bees on a food source.
Recruitment is the action that an unemployed bee executes with forager bees to exploit a food source.
Abandonment of a nectar source occurs when a better food source is found.
An important behavior of employed and onlooker bees is their capacity of sharing information (memory) to choose and adjust the food source value. This value depends on the proximity to the nest, the richness or concentration of honey energy . The exchange of information occurs during the waggle dance at the hive. Onlooker foragers watch numerous dances at the dancing area and decide to employ themselves at the most profitable food source. When an onlooker forager recruit starts searching and locates the food source, then it utilizes its own capability to memorize the location and starts exploiting it. The onlooker forager becomes an employed forager. In the ABC algorithm the set of possible solutions represent the food sources, and the food source value represents the quality of the solution. A general representation of the ABC workflow algorithm is presented in Figure 5.
6. Hybrid algorithm description
A hybrid algorithm was proposed to make the ABC auto-adjustable during each iteration to decide the number of clusters by incorporating K-means and a Consensus method. In particular, K-means is used to select the elements inside each generated cluster to decide centroids for similarity calculations. The solution of the algorithm is represented as a vector of size n (number of Web services to cluster) where each position of the element in the vector is the group to which it belongs to.
6.1 Objective function
The objective function of this hybrid algorithm is shown in Eq. (4):
where d, distance between centroid of cluster yi and a service xi; yi, centroid of cluster i; xi: one of the services included in cluster i. No group of services should be empty, and there should be no intersection between groups.
6.2 Filtering similarities
The first stage of the hybrid algorithm consists of filtering of the eight matrices that contain the information of similarities between Web services. The filtering consists of discarding values that exceed the limits allowed and established by the similarity measures, as a result of this filtering, new matrices are generated with a degree of 95% certainty in the measurements. Eq. (5) shows the filtering calculation:
where X, average matrix; 1.96, table value; σ, standard deviation; N, element of the similarity matrix; μ, average similarity.
6.3 Food source representation
After the filtering process, all obtained data is stored in an average matrix (food sources) discarding the positions that contain null or zero information; that is, the average matrix was calculated considering only those values that were filtered and accepted as feasible values that contribute with information to the hybrid algorithm.
6.4 Bee’s representation
Using the average matrix a set of arrays is generated representing the bees and other important information as the number of groups and centroids. Table 3 shows the structure of the solution generated.
Max group. This hybrid algorithm does not require the user to indicate how many groups it should generate; the algorithm as it iterates determines how many groups to generate based on the results obtained on the previous iteration and applying the Consensus method. Initially, the i-th bee will generate a random number of groups, based on a discrete uniform distribution with limits 2 to N/2; in the subsequent iterations of the algorithm, the i-th bee will produce a random number γ (based on a normal distribution of the weighted variance of the Max group determined by the colony in the previous iteration), then a simple rounding will be applied to γ.
Centroids. Next, the i-th bee must determine the centroids of the γ groups involved in the classification. The centroid sub-vector is formed by N integer elements, where xijk is zero if it is not considered as a centroid for any group, in case xijk = a implies that the j-th Web service is centroid of the k-th group. Initially such values are assigned randomly; later by applying Eq. (6), the values of the sub-vector are obtained:
|Groups generated||Centroids||β||α||Normalization||Assessment||Max group||Limit|
where , new vector; , first vector generated by the algorithm; , aleatory number between 0 and 1
β represents the sum of the similarity between centroids of the groups, while α is the summation of the similarity between members of each group to the corresponding centroid. The objective is to minimize β and maximize α simultaneously.
For each bee the assessment function is calculated using ; the objective is to obtain the highest f(x) value. During each iteration, the f(x) value is stored in the solution vector.
Normalization is used to determine the quality of the food source found. Normalization is calculated with . The bee will abandon the food source if there is a better food source in the near surrounding.
Limit is a counter that indicates the number of iterations that the employed bee has being exploiting the current food source. The employed bee is obligated to abandon the food source when a g number of iterations is achieved (Figures 6–8).
7. Experimentation and results
Experiments were carried out with three service collections, which involved grouping 50, 647, and 1027 Web services, respectively. For the creation of the similarity matrices, eight different semantic similarity measures were calculated for each collection of Web services. The resulting similarity matrices were filtered to discard values that exceed the allowed limits.
The determination of the parameters used by the algorithm was performed by brute force, resulting in using 10 and the value of “phi” (φ) set to 0.8; the value of the limit that makes up each vector from the beginning was established with the value of 10.
In order to characterize the behavior of the algorithm, 20 executions were made with 100, 200, and 500 iterations of the algorithm. For each of the executions, the best value of f(x) found by the bees was determined. Subsequently, a statistical analysis of the results found was carried out. Table 4 shows the average values of f(x) for each of the instances with 100, 200, and 500 iterations, respectively.
|Matrix size||100 iterations||200 iterations||500 iterations|
Based on the results shown in Table 4, it can be affirmed that for the 50 services instance, the best values are found with 500 iterations, while for the instance of 647 services, the best values are obtained with 100 iterations. Finally, for the instance of 1027, it is obtained with 200 iterations.
Figures 9 and 10 show a comparison of results obtained from the algorithm with 100, 200, and 500 iterations of the instance of 1027 services. It can be seen that the best values of f(x) are produced with 500 iterations; however the result of the number of groups involved is more stable and better with 200 iterations.
This chapter describes a hybrid algorithm for Web services clustering; the approach is based on the ABC optimization algorithm combined with the K-means and Consensus. The clustering process starts from the extraction and processing of WSDL documents, then the calculation of semantic similarities between pairs of Web services operations. The semantic similarity measurements are based on WordNet, combining corpus-based and taxonomic approaches. As a result of the calculations, eight matrices are generated which are the input data set to the hybrid bio-inspired algorithm.
Biologically inspired algorithms offer advantages over conventional clustering methods, such as the ability to find a near optimal solution by updating the candidate solutions iteratively and have self-organizing behavior.
The clustering algorithm designed is based on the behavior of the bees but was improved by incorporating K-means and Consensus so that the algorithm adjusts itself in each iteration. A series of experiments were conducted with the hybrid ABC algorithm, using optimal values for each adjustable parameter. The experiments were carried out with three collections with 50, 647, and 1027 Web services, and the algorithm was executed with variants of 100, 200, and 500 iterations. The hybrid ABC algorithm has shown good results for Web services clustering.
As future work, more combinations of semantic similarities, as well as incorporating more information of the description of the services and the data types incorporated in the XML service definitions.
Also the incorporation of other bio-inspired algorithms (swarm intelligence) for the classification and clustering of Web services, such as Particle Swarm Optimization (PSO), Bat Algorithm (BA), and Bird Swarm Algorithm (BSA).