Open access peer-reviewed chapter - ONLINE FIRST

Bio-Inspired Hybrid Algorithm for Web Services Clustering

By Maricela Bravo, Román A. Mora-Gutiérrez and Luis F. Hoyos-Reyes

Submitted: November 10th 2018Reviewed: February 14th 2019Published: March 19th 2019

DOI: 10.5772/intechopen.85200

Downloaded: 80

Abstract

Web services clustering is the task of extracting and selecting the features from a collection of Web services and forming groups of closely related services. The implementation of novel and efficient algorithms for Web services clustering is relevant for the organization of service repositories on the Web. Counting with well-organized collections of Web services promotes the efficiency of Web service discovery, search, selection, substitution, and invocation. In recent years, methods inspired by nature using biological analogies have been adapted for clustering problems, among which genetic algorithms, evolutionary strategies, and algorithms that imitate the behavior of some animal species have been implemented. Computation inspired by nature aims at imitating the steps that nature has developed and adapting them to find a solution of a given problem. In this chapter, we investigate how biologically inspired clustering methods can be applied to clustering Web services and present a hybrid approach for Web services clustering using the Artificial Bee Colony (ABC) algorithm, K-means, and Consensus. This hybrid algorithm was implemented, and a series of experiments were conducted using three collections of Web services. Results of the experiments show that the solution approach is adequate and efficient to carry out the clustering of very large collections of Web services.

Keywords

  • artificial bee colony
  • K-means
  • Consensus
  • hybrid algorithms
  • Web services clustering
  • semantic similarity measures

1. Introduction

Web services clustering is the task of selecting and extracting the features of a collection of Web services, discovering the similarities between them to form groups or classes considering those features. The implementation of novel and efficient algorithms for the automatic clustering of Web services is relevant for the organization of large collections of services in private or public network such as the Internet. Having a directory of Web services organized in groups according to one or more characteristics represents an advantage during the search, selection, invocation, substitution, and composition of Web services.

Methods inspired by nature using biological analogies have been adapted for clustering problems, among which genetic algorithms, evolutionary strategies, and algorithms that imitate the behavior of some animal species have been implemented. Living beings such as animals and plants and even the climate exhibit extraordinary, complex, and fascinating natural phenomena. Of particular interest is the intelligent behavior of some animal species to find a solution to solve a problem and maintain the perfect balance of the environment surrounding.

This is the main idea of computation inspired by nature, that is, to imitate the steps that nature has developed and adapt them to find a solution of a problem, thus converting it into a bio-inspired algorithm. This is the main reason for the implementation of an algorithm that exploits the collective intelligence of a Bee Colony as an alternative for clustering. This chapter presents an innovative approach for Web services clustering using a hybrid algorithm based on the Artificial Bee Colony, K-means, and Consensus.

The work described in this chapter is part of a research project whose objective is to design and implement a semantic directory of Web services using efficient, fully automated methods to allow the organization, composition, and classification of Web services with a semantic approach.

Figure 1 shows the general architecture of the semantic directory and a methodology to construct and manage the directory. The methodology consists of the following phases:

  1. Public Web service retrieval aims at searching over the Internet to find and copy Web service descriptions (files formatted in WSDL description language) in a local file directory. Web service retrieval is executed through crawlers designed specifically to parse and identify links to files in WSDL.

  2. Extraction and analysis of Web services consists of parsing every Web service description file and the extraction of specific data, such as: method names, input and output parameters, and port types to facilitate the automatic invocation. Extracted data is stored in an ontology model. For this phase, we use a tool which transforms WSDL files into an ontological representation.

  3. Web service similarity calculation is an important phase of the methodology, because classification and clustering of Web services requires the calculation of distances between services. In order to calculate similarities, different measures can be implemented and combined to obtain better results.

  4. Web service classification and clustering consists of selecting and extracting the characteristics of a collection of Web services and discovering the similarities between them to form groups or classes considering those characteristics. Therefore, this phase depends on the results of the previous phases. In particular, in this chapter a hybrid clustering algorithm based on the Artificial Bee Colony algorithm, the K-means, and Consensus is described.

  5. Inference and reasoning represents the supporting mechanisms to exploit and utilize the enriched Web service ontologies.

Figure 1.

Semantic directory of Web services.

2. Related work

Web service classification and clustering is a topic addressed form different perspectives such as statistical, stochastic, and novel approaches based on bio-inspired algorithms. Additionally, semantic approaches to describe, discover, and invoke Web services have been studied to propose novel clustering algorithms. In this section a revision of related work is presented considering two trends: reported work that address clustering and classification of Web services and clustering approaches based on bio-inspired algorithms. Table 1 presents the main characteristics of related work.

WorkInput dataClustering approachUse of ontologiesSimilarity approachNumber of Web services usedService repositoryBenefitsLimitations
Liang et al. [1]WSDL documentsIncremental K-means algorithm
Bisecting K-means
NoTree-based structure matching352Xmethods.com
Bindingpoint.com
Webservicelist.com
Xignite.com
This approach clusters Web service documentsThe similarity approach is not semantic, and the clustering method is not bio-inspired
Platzer et al. [2]WSDL documentsStatistical clustering analysisNoEuclidean distance275Xmethods.comThis approach clusters Web service documentsThe clustering method is not based on novel bio-inspired algorithms.
Similarity measure is not semantic
Pop et al. [3]WSDL documents extracted from OWL-TC4Particle swarm and ant-based service clusteringYesSemantic similarity by evaluating the Degree of Match (DoM)894SAWSDL-TC collectionThe solution approach is very similar to the approach described in this chapter. The main difference is on the algorithms utilizedThe semantic similarity does not use a lexical database to improve similarity measures
Du et al. [4]WSDL documents extracted from OWL-TC4Bottom-up hierarchical clusteringNoSemantic similarity based on WordNet1075OWL-TC4 collectionThis work is closely related with the approach presented in this chapterThe clustering method is not based on novel bio-inspired algorithms
Wu et al. [5]WSDL documentsK-meansNoThe similarity integrates all the feature measures using a weighed sum15,968SeekdaThis approach clusters Web service documentsThe clustering method is not based on bio-inspired algorithms. Similarity measure is not semantic
Prakash and Singh [6]Data is based on three real and one synthetic data setsGenetic Algorithm, Differential Evolution, Particle Swarm Optimization, and Artificial Bee ColonyNoNot specifiedNot for Web servicesNoThe clustering approach is based on novel bio-inspired algorithms.The clustering method is not applied to Web services
Sahoo [7]Data is downloaded from UCI repositoryTwo-step ABC algorithmNoEuclidean distanceNot for Web servicesData sets are downloaded from the UCI repositoryThe clustering approach is based on novel bio-inspired algorithmsThe clustering method is not applied to Web services. Similarity measure is not semantic
Kotekar and Kamath [8]WSDL documents extracted from OWL-TC4Cat Swarm Optimization (CSO)NoEuclidean distance and TF-IDF1083OWL-TC4 collectionThe clustering approach is based on novel bio-inspired algorithmsSimilarity measure is not semantic

Table 1.

Comparison of related work.

In 2009, Liang et al. [1] proposed a method for Web service categorization considering keywords and semantic relations between elements of the description. Their proposed methodology involves preprocessing WSDL documents, rough clustering by labeling Web services with class tag, and fine clustering.

In 2009, Platzer et al. [2] described a scalable approach for clustering very large service repositories. They use a statistical clustering algorithm enhancing a vector space to support the search of services related to a given query.

In 2012, Pop et al. [3] presented two approaches for service clustering, one inspired by the behavior of the birds and other inspired by the behavior of ants. They implemented methods to evaluate the semantic similarity between services.

In 2013, Du et al. [4] presented an approach for clustering Web services based on functional similarity and refinement of clusters using a concept position vector.

In 2014, Wu et al. [5] presented an approach which consists of three modules: data preprocessing, Web service tag recommendation, and Web services clustering. The first module consists of building a content vector formed with nouns, verbs, or adjectives. Authors use different features and different approaches for similarity computation. For content use the normalized Google distance, for data types and messages they use a basic match similarity, and for tag similarity they apply the Jaccard coefficient.

In 2014, Prakash and Singh [6] compared the performance of evolutionary algorithms: Genetic Algorithm, Differential Evolution, Particle Swarm Optimization, and Artificial Bee Colony for clustering three real and one synthetic data sets.

In 2017, Sahoo [7] presented a two-step ABC algorithm for data clustering problems. Authors addressed the three problems of the ABC algorithm such as initial positions of food sources, solution search equation, and abandoned food location.

In 2018, Kotekar and Kamath [8] described a Web services clustering approach based on Cat Swarm Optimization (CSO), which emulates the social behavior of cats in nature.

Automated Web services clustering is useful to facilitate service search, service discovery, service composition, and service substitution. Of particular interest is the representation of Web services through ontologies because the purpose of this work is the automatic organization of any collection (public or private) of Web service in ontologies and their semantic enrichment by classification and clustering.

3. Clustering process

Clustering of Web services consist of partitioning the set of Web services in the collection into an appropriate number of clusters based on a similarity measure. Therefore, services in the same cluster are more similar than the services in the different clusters [6].

In this section the clustering approach implemented is described. This process has as input a collection of Web services formatted according to Web Service Description Language (WSDL). This collection of Web services is processed utilizing specific parsers to extract the most important data of the service description, which are the method names and input and output parameters. The detailed process is described in the following subsections ( Figure 2 ).

Figure 2.

Clustering process of Web services.

3.1 Extracting and parsing

Every Web service description includes the definition of the programming interfaces to be invoked remotely. Figure 3 shows the abstract service interface definition; from this, the elements extracted for similarity calculation are the name of the operations and their associated input and output parameters.

Figure 3.

Web service interface definition.

4. Semantic similarity measures

Measuring the similarity between two concepts is not a new topic. Throughout the last decades, many measures of similarity have been reported using different perspectives: syntactic, semantic, contextual, etc. In this work, we use a set of semantic similarity measurements based on WordNet. 1 Computing similarity between all Web services in the collection is a process executed in pairs. Let W be the tuple that represents all Web services in the collection as follows:

W=PIOE1

where P, represents all operation names; I, is the set of input parameters; O, is the set of output parameters.

In particular, in this work the similarity measures were applied only on the operation names. Therefore, the similarity calculation takes as input a matrix of all operation names in the collection of Web services, that is, as follows:

LetP=p1p2p3pnE2
Input Matrix=pi,qiP×P1inE3

Eight measures that exploit WordNet database were used to calculate the semantic similarity between Web service operations. WordNet is a lexical database available online; it is organized into five categories: nouns, verbs, adjectives, adverbs, and function words [9]. The utilization of WordNet for semantic similarity measurements is a good approach in contrast with the traditional syntactic similarity approaches, specifically in the case of service operations, as they normally include a verb indicating the main functionality of the operation method.

Additionally, an application programming interface (API) that implements a large collection of semantic similarity measures (140 methods) is available WNetSSAPI 2 . A deeper analysis and comparison of similarity measures is out of the scope of this work. Table 2 shows a summary of the semantic similarity measures used.

Semantic similarity measureDescriptionApproachAdvantage
Lesk [10];
Banerjee and Pedersen [11]
Address word sense disambiguation by counting overlaps between dictionary definitionsCorpus-based approachIt is not a syntactic technique, and it is not dependent on global information
Wu and Palmer [12]Path length to the root node from the least common super-concept of two conceptsTaxonomic-based approachIt is a syntactic-semantic technique
Resnik [13]Evaluate the semantic similarity in a taxonomy, based on the notion of information contentIt is a hybrid approach that combines corpus-based statistical methods with knowledge-based taxonomic informationIt is a semantic technique
Jian and Conrath [14]It combines a lexical taxonomy structure with corpus statistical informationIt is a hybrid approach that combines taxonomic-based approach with corpus-based approachBased on the tests reported, this combined approach outperforms other computational models
Lin [15]This measure uses the amount of information needed to state the commonality between the two concepts and the information needed to describe these termsInformation content measure (corpus-based)It is a universally applicable similarity measure, independent of domain or form of knowledge representation
Hirst Onge [16]This measure states that two lexicalized concepts are semantically close if their synonyms are connected by a path that is not too long, and it is not changing its direction frequentlyInformation content measure that uses lexical chains as a contextIt is a semantic and context-based technique
Leacock and Chodorow [17]This measure finds the shortest path length between two concepts, and scales that value by the maximum path length in the is-A hierarchy in which they occurThis is an information content measure that adds topical to local context using a statistical classifierIt is a semantic and context-based technique

Table 2.

Summary of similarity measurements.

With these measures, all service operations are compared, and a set of eight matrixes are created with the distances between them. Figure 4 shows an example of the calculation of the eight similarities with operation names.

Figure 4.

Example of the calculation of semantic similarities.

5. Artificial Bee Colony algorithm

The Artificial Bee Colony (ABC) algorithm is an optimization technique that simulates the foraging behavior of honey bees and has been successfully applied to various practical problems and is a nature-inspired and swarm intelligence method that has been applied in different scenarios with good results. The ABC algorithm was proposed in 2005 by Karaboga [18, 19]; accordingly, the collective intelligence model of the Bee Colony consists of:

  1. Employed foragers which are bees assigned (employed) to a particular food source and are exploiting it. They carry information about the food source, distance and direction to the nest and the profitability of the source, and are capable of sharing this information.

  2. Unemployed foragers are bees that are continuously searching for food sources. These unemployed bees are subdivided into scouts, bees that search on the surrounding environment for new food sources, and onlookers, bees that wait in the nest.

The ABC algorithm has different modes of behavior:

  1. Exploration is the task executed by unemployed bees to find new food sources.

  2. Exploitation is the task executed by employed bees on a food source.

  3. Recruitment is the action that an unemployed bee executes with forager bees to exploit a food source.

  4. Abandonment of a nectar source occurs when a better food source is found.

An important behavior of employed and onlooker bees is their capacity of sharing information (memory) to choose and adjust the food source value. This value depends on the proximity to the nest, the richness or concentration of honey energy [18]. The exchange of information occurs during the waggle dance at the hive. Onlooker foragers watch numerous dances at the dancing area and decide to employ themselves at the most profitable food source. When an onlooker forager recruit starts searching and locates the food source, then it utilizes its own capability to memorize the location and starts exploiting it. The onlooker forager becomes an employed forager. In the ABC algorithm the set of possible solutions represent the food sources, and the food source value represents the quality of the solution. A general representation of the ABC workflow algorithm is presented in Figure 5 .

Figure 5.

ABC algorithm general workflow.

6. Hybrid algorithm description

A hybrid algorithm was proposed to make the ABC auto-adjustable during each iteration to decide the number of clusters by incorporating K-means and a Consensus method. In particular, K-means is used to select the elements inside each generated cluster to decide centroids for similarity calculations. The solution of the algorithm is represented as a vector of size n (number of Web services to cluster) where each position of the element in the vector is the group to which it belongs to.

6.1 Objective function

The objective function of this hybrid algorithm is shown in Eq. (4):

Mini=1xciyciCdxiyiE4

where d, distance between centroid of cluster yi and a service xi ; yi , centroid of cluster i; xi : one of the services included in cluster i. No group of services should be empty, and there should be no intersection between groups.

6.2 Filtering similarities

The first stage of the hybrid algorithm consists of filtering of the eight matrices that contain the information of similarities between Web services. The filtering consists of discarding values that exceed the limits allowed and established by the similarity measures, as a result of this filtering, new matrices are generated with a degree of 95% certainty in the measurements. Eq. (5) shows the filtering calculation:

X1.96σNμX+1.96σNE5

where X, average matrix; 1.96, table value; σ, standard deviation; N, element of the similarity matrix; μ, average similarity.

6.3 Food source representation

After the filtering process, all obtained data is stored in an average matrix (food sources) discarding the positions that contain null or zero information; that is, the average matrix was calculated considering only those values that were filtered and accepted as feasible values that contribute with information to the hybrid algorithm.

6.4 Bee’s representation

Using the average matrix a set of arrays is generated representing the bees and other important information as the number of groups and centroids. Table 3 shows the structure of the solution generated.

  1. Max group. This hybrid algorithm does not require the user to indicate how many groups it should generate; the algorithm as it iterates determines how many groups to generate based on the results obtained on the previous iteration and applying the Consensus method. Initially, the i-th bee will generate a random number of groups, based on a discrete uniform distribution with limits 2 to N/2; in the subsequent iterations of the algorithm, the i-th bee will produce a random number γ (based on a normal distribution of the weighted variance of the Max group determined by the colony in the previous iteration), then a simple rounding will be applied to γ.

  2. Centroids. Next, the i-th bee must determine the centroids of the γ groups involved in the classification. The centroid sub-vector is formed by N integer elements, where xijk is zero if it is not considered as a centroid for any group, in case xijk  = a implies that the j-th Web service is centroid of the k-th group. Initially such values are assigned randomly; later by applying Eq. (6), the values of the sub-vector are obtained:

Groups generatedCentroidsβαNormalizationAssessmentMax groupLimit
32,23132,0011.0681.550.3010.3635

Table 3.

Composition of the vector with information of generated groups.

Xnew=roundXiϕXiXsE6

where Xnew, new vector; Xi, first vector generated by the algorithm; ϕ, aleatory number between 0 and 1

  1. β represents the sum of the similarity between centroids of the groups, while α is the summation of the similarity between members of each group to the corresponding centroid. The objective is to minimize β and maximize α simultaneously.

  2. For each bee the assessment function is calculated using fx=1β+α; the objective is to obtain the highest f(x) value. During each iteration, the f(x) value is stored in the solution vector.

  3. Normalization is used to determine the quality of the food source found. Normalization is calculated with ni=fxii=1Nfxi. The bee will abandon the food source if there is a better food source in the near surrounding.

  4. Limit is a counter that indicates the number of iterations that the employed bee has being exploiting the current food source. The employed bee is obligated to abandon the food source when a g number of iterations is achieved ( Figures 6 8 ).

Figure 6.

ABC algorithm pseudocode.

Figure 7.

Pseudocode of the “Generation of N initial food sources.”

Figure 8.

Pseudocode of “Search for a food source in the neighborhood of i.”

7. Experimentation and results

Experiments were carried out with three service collections, which involved grouping 50, 647, and 1027 Web services, respectively. For the creation of the similarity matrices, eight different semantic similarity measures were calculated for each collection of Web services. The resulting similarity matrices were filtered to discard values that exceed the allowed limits.

The determination of the parameters used by the algorithm was performed by brute force, resulting in using 10 and the value of “phi” (φ) set to 0.8; the value of the limit that makes up each vector from the beginning was established with the value of 10.

In order to characterize the behavior of the algorithm, 20 executions were made with 100, 200, and 500 iterations of the algorithm. For each of the executions, the best value of f(x) found by the bees was determined. Subsequently, a statistical analysis of the results found was carried out. Table 4 shows the average values of f(x) for each of the instances with 100, 200, and 500 iterations, respectively.

Matrix size100 iterations200 iterations500 iterations
f(x)f(x)f(x)
500.46310.50110.5169
6470.44570.41520.4228
10270.44140.47820.4542

Table 4.

Average results of collections executing 100, 200, and 500 iterations.

Based on the results shown in Table 4 , it can be affirmed that for the 50 services instance, the best values are found with 500 iterations, while for the instance of 647 services, the best values are obtained with 100 iterations. Finally, for the instance of 1027, it is obtained with 200 iterations.

Table 5 shows the results obtained with configurations: the best, worst, average, variance, and standard deviation of f(x) over the 20 executions of the algorithm ( Table 6 ).

Collection-iterationsBestWorstMeanVarianceDeviation
50–1000.589362080.292749880.459014810.005933540.0770295
50–2000.548531170.449045350.500524780.000883010.02971549
50–5000.571635350.465549650.518209760.000900810.03001343
647–1000.549475840.21531860.444014510.010271330.10134758
647–2000.547787450.202579950.409230460.015545650.12468218
647–5000.548815410.222628720.419112880.012802780.11314939
1027–1000.555318060.273780020.444889310.008478710.09207991
1027–2000.568464220.289193870.475295050.007720780.08786795
1027–5000.661120250.203021450.46747170.009773010.09885856

Table 5.

Summary with the best results of similarity.

Collection-iterationsBestWorstMeanVarianceDeviation
50–1002104.315789476.221052632.49420381
50–2002430.892105260.94451324
50–500263.947368422.365789471.53811231
647–10034918.0526316221.20789514.8730594
647–20026718.9473684305.83157917.488041
647–50027323.2105263426.36842120.6486905
1027–100211031.26315791066.8289532.6623475
1027–20026922.1052632355.88421118.8648936
1027–5002123311179.2105334.3396349

Table 6.

Summary with the best results of clusters.

Figures 9 and 10 show a comparison of results obtained from the algorithm with 100, 200, and 500 iterations of the instance of 1027 services. It can be seen that the best values of f(x) are produced with 500 iterations; however the result of the number of groups involved is more stable and better with 200 iterations.

Figure 9.

Comparison of similarity between the best solutions found with 1027 services.

Figure 10.

Comparison of the number of groups in the best case for 1027 services.

8. Conclusions

This chapter describes a hybrid algorithm for Web services clustering; the approach is based on the ABC optimization algorithm combined with the K-means and Consensus. The clustering process starts from the extraction and processing of WSDL documents, then the calculation of semantic similarities between pairs of Web services operations. The semantic similarity measurements are based on WordNet, combining corpus-based and taxonomic approaches. As a result of the calculations, eight matrices are generated which are the input data set to the hybrid bio-inspired algorithm.

Biologically inspired algorithms offer advantages over conventional clustering methods, such as the ability to find a near optimal solution by updating the candidate solutions iteratively and have self-organizing behavior.

The clustering algorithm designed is based on the behavior of the bees but was improved by incorporating K-means and Consensus so that the algorithm adjusts itself in each iteration. A series of experiments were conducted with the hybrid ABC algorithm, using optimal values for each adjustable parameter. The experiments were carried out with three collections with 50, 647, and 1027 Web services, and the algorithm was executed with variants of 100, 200, and 500 iterations. The hybrid ABC algorithm has shown good results for Web services clustering.

As future work, more combinations of semantic similarities, as well as incorporating more information of the description of the services and the data types incorporated in the XML service definitions.

Also the incorporation of other bio-inspired algorithms (swarm intelligence) for the classification and clustering of Web services, such as Particle Swarm Optimization (PSO), Bat Algorithm (BA), and Bird Swarm Algorithm (BSA).

Notes

  • https://wordnet.princeton.edu/
  • http://wnetss-api.smr-team.org/

Download

chapter PDF

© 2019 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Maricela Bravo, Román A. Mora-Gutiérrez and Luis F. Hoyos-Reyes (March 19th 2019). Bio-Inspired Hybrid Algorithm for Web Services Clustering [Online First], IntechOpen, DOI: 10.5772/intechopen.85200. Available from:

chapter statistics

80total chapter downloads

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More About Us