Conventional lexical-clustering algorithms treat text fragments as a mixed collection of words, with a semantic similarity between them calculated based on the term of how many the particular word occurs within the compared fragments. Whereas this technique is appropriate for clustering large-sized textual collections, it operates poorly when clustering small-sized texts such as sentences. This is due to compared sentences that may be linguistically similar despite having no words in common. This chapter presents a new version of the original k-means method for sentence-level text clustering that is relay on the idea of use of the related synonyms in order to construct the rich semantic vectors. These vectors represent a sentence using linguistic information resulting from a lexical database founded to determine the actual sense to a word, based on the context in which it occurs. Therefore, while traditional k-means method application is relay on calculating the distance between patterns, the new proposed version operates by calculating the semantic similarity between sentences. This allows it to capture a higher degree of semantic or linguistic information existing within the clustered sentences. Experimental results illustrate that the proposed version of clustering algorithm performs favorably against other well-known clustering algorithms on several standard datasets.
Part of the book: Recent Applications in Data Clustering