Venetian Gini selection of words, mean, and SD (n = 2286).
Abstract
With the rise of social media such as blogs and social networks, these interpersonal communication expressed by online reviews has become more and more important as an influential source of information both for the managers and for the consumers. In-depth purchasing-related information is made available to markers. Now we can utilize this new source of information to understand how consumers evaluate products and make decision in relation with it. Since reviews are text data, new ways to analyze the data is needed and text-mining plays the role here together with the help of traditional statistical methods. With these methods, we can examine the contents of reviews and identify the key areas that impact consumers’ decision-making.
Keywords
- reviews
- text-mining
- decision-making
- consumers
- analytics
1. Introduction
We finally enter this technologic era where people and technology integrate with each other. It only took 4 years for Internet to reach 50 million people while comparing to telephone it is 75 years. Especially for Generation Z, majority of them grow up using the Internet and social media. Overall, more than half of the world’s population is online. The rise of Internet and especially social media shifted the way people communicate and interact with each other. With the rise of social media such as blogs and social networks, these interpersonal communication expressed by online reviews has become more and more important as an influential source of information both for the managers and consumers. With the rapid growth of comments by consumers over the Internet, in-depth purchasing related information is available to markers. The wide availability of lengthy and numerous text-based online reviews provides a treasure trove of information that can potentially reveal a much wider set of variables that determine the consumers’ attitude/evaluation toward the products. There has been numerous of research on how to utilize the information. In [1], the authors investigated consumers’ usage of online recommendation sources and their influence on online product choices. Later in [2], Kumar and Benbasat use empirical evidence to demonstrate the influence of recommendations and online reviews on the consumers’ perceptions of usefulness and social presence of the websites. As for the firms, online consumer reviews can provide valuable insights and help them improve their products accordingly.
There also have been many research articles (i.e., “for example, see [3, 4, 5, 6]”) which try to identify the variables that affect the decisions of individuals to make recommendations of product, or not. By their very nature, these studies are only able to identify a limited number of such determinant variables. In particular, customers’ satisfaction has been linked to recommendation to others as in [7], Ladhari et al. identified three drivers—perceived service quality, emotional satisfaction, and image—that are positively related to each other and positively influence loyalty and recommendation. However, almost all studies in the previous research have used numeric variables. So only a limited number of determinants have been studied.
Built upon the previous work, we utilize text-mining method to identify the important product dimensions comparing to the traditional survey method, which are highly related to the quality and thus consumers’ attitudes toward the products.
2. Methodology
In this section we describe the methods that we use for analysis of text content. So far text mining has become a very standard procedure to deal with text and here the detailed process is listed for education purpose.
Text classification is a supervised learning process to predict the class of a document based on a set of features describing the document [8]. The predefined categories are given compared to the un-supervised learning process. The prediction model is automatically learned from a training set and can be used to predict new cases. Text classification utilizes various machine-learning algorithms to classify the sentence based text documents into one of the previous defined categories. Suppose we have a set of documents which could be the reviews posted on the websites by consumers, emails by various users, etc. A vector of attributes represents each document as
2.1 Preprocessing
Before applying the learning methods, several preprocessing steps are necessary to get the data in the ready format for future analysis. The preprocessing of raw data includes: raw text tokenization, case conversion, stop-words removal, and stemming.
Firstly, the raw texts are divided into tokens (single word, special symbols, etc.) using whitespaces (space, tab, newline character, etc.) as separators to break the entire review document into tokens. For example, suppose we have a document “
2.2 Indexing
The result so far is a high-dimensional term-by-document matrix with each cell represents the raw frequencies of appearance for each term in each document. The rows of the matrix correspond to terms (usually terms are words), and the columns represent documents (reviews for example). In [10], Spark Jones showed that there is a significant improvement in retrieval performance by using the weighted terms vectors. The term weight is often decided by the product of the Term Frequency (TF) and the Inverse Document Frequency (IDF) by Spark Jones [11].
The TF measures the frequency of the occurrence of an indexed term in the document [12]. The higher the frequency is, the more important this term is in characterizing the document. Such frequency of occurrence of an indexed word is used to indicate term importance for content representation, i.e., “for example, see [13, 14, 15].” In our study, the TF was obtained by the raw term frequency. However, not every word appears equally across the whole set of review documents. Some words appear more frequently than others by nature. The more rarely a term occurs in a document collection, the more discriminating that term is. Therefore the weight of a term is inversely related to the number of documents in which it appears. So IDF is used to take into account of this effect. The logarithm of the IDF was taken to reduce the effect of raw IDF-factor.
Finally the total weight of a term
Here,
Mathematically,
2.3 Multi-word phrases
So far the tokenization gives the term-by-document matrix. Each term in the matrix is the frequency of a single word. As most of the cases, multi-word phrases are also important because phrases have more complete context information than individual word. So the most popular class of features used for text classification is n-grams [16]. Word n-gram includes the single word (unigram), and higher order n-grams like bi-grams, tri-grams. Word n-grams have been used effectively in various studies. Unigram to tri-grams have typically been used in text mining and large n-gram phrases set require the following use of attribute selection to reduce the dimensionalities [17, 18]. For instance, we have sentence “
2.4 Dimensionality reduction
So far this weighted term-by-document matrix is a high dimensional matrix due to the many distinct terms. Moreover, it is very sparse with many zeros since not all documents contain all terms. Large attribute dimensionality incurs high computational cost and more seriously cause over-fitting problem on many classification methods. We choose Gini index as our method for attributes selection since it is base upon the distinguishing ability of the word as well as importance of the word.
Gini index was proposed and studied by Aggarwal and Chen [19]. It aims to decide which feature variables are decision variables for a decision support application. In the training data the key decision variables are identified and trained to predict the decisions classes. Training dataset Dtrian contains n reviews and each review q belongs to a predefined class with labels s which is drawn from the set {1…k}. Overall we have a dxn feature-review matrix with each feature is denoted i with i range from 1 to d and each review is denoted by q with q range from 1 to n. In our case since the labels will be a binary situation of recommend or not. Now the Gini index is calculated to define the level of class discrimination among the data points of each feature as follows:
Then we can use Gini index to help us find the key features that are important to the decisions. With a bigger Gini index, it indicates a higher discriminating ability of that word. So we set a threshold of choosing high value Gini-indexed attributes. In previous research the frequency of occurrence of an indexed word has been used to indicate term importance for content representation [13, 14, 15]. So we set another threshold of selecting attributes based on the frequency.
2.5 Classification technique
There are various classification techniques applied in text mining such as Naïve Bayesian, vector support machine (SVM), and decision trees. SVM performs classification more accurately than most other methods in applications, especially for high dimensional data. SVM was invented by Vapnik and Chervonenkis in [20] and has been used a lot in various areas [16, 21]. SVM are supervised learning models that can classify data into the groups. Given a set of training examples, each data record is marked as one or the other of two categories. An SVM training algorithm builds a model that can assign new examples to one category or the other. In our example, we have categories of class: recommend or not recommend the product to others.
2.6 Evaluation criteria
In order to evaluate the performance of different classification models, the most common measure of accuracy is used.
3. Data and analysis results
In order to illustrate the method we proposed, we applied the method on two examples from two industries for generalization—hotel industry and clothing industry.
3.1 Example 1: hotel industry
For hotel industry, the data was obtained from orbitz.com, which is one of the leading websites in the travel industry in US. On the websites, consumers can only leave their reviews, ratings, and recommendation choices after they stayed in the hotel and registered with the hotel. We collected the data of a high quality hotel in Las Vegas: five-star hotel “Venetian.” We chose Las Vegas among the various cities across the whole nation because it is one of the most popular tourist cities in the U.S., and attracts a large number of hotel consumers staying and leaving reviews. We pick a five-star hotel because as in Las Vegas, in order to attract visitors, lots of high-level hotels were built and also because of the low price comparing to other locations, five-star hotels are very popular among consumers. Figure 1 showed an example of the data.

Figure 1.
Review data.
After preprocessing of the raw reviews we get the term (attribute) by document matrix. For each attribute, we calculated the Gini index of that feature and select only the ones with a Gini value higher then 0.75 [19] and also frequency is higher than the average frequency of the words appearance. Through this we are able to find the major attributes that are both important and distinguishing in the evaluations of the hotel. List of feature is shown in Table 1.
Feature | Mean | SD |
---|---|---|
77315room | 0.615905982 | 0.840249787 |
29733stai | 0.551547381 | 0.82165562 |
88564staff | 0.400633212 | 0.777731114 |
38586time | 0.468865486 | 1.014960226 |
3343beauti | 0.386801959 | 0.830438556 |
105417locat | 0.369390225 | 0.805449742 |
70746servic | 0.398964787 | 0.993875946 |
80313strip | 0.360965236 | 0.937546462 |
78256restaur | 0.323249126 | 0.818511494 |
55318casino | 0.358101524 | 1.021703983 |
107009pool | 0.366315944 | 1.102540841 |
84314shop | 0.302824825 | 0.8928519 |
105376experi | 0.319579518 | 0.940387285 |
63878comfort | 0.283232639 | 0.809203564 |
98053friendli | 0.271059238 | 0.786200556 |
98561food | 0.267736792 | 0.85655244 |
66324bathroom | 0.265218368 | 0.859496882 |
40527bed | 0.27890465 | 0.937873975 |
84318show | 0.265786823 | 0.903504893 |
74378price | 0.229886661 | 0.855276916 |
4056view | 0.248388399 | 1.00651292 |
44890luxuri | 0.221286503 | 0.832020881 |
91332getawai | 0.19446686 | 0.748517432 |
53441spaciou | 0.177143261 | 0.74316694 |
25711weekend | 0.185364962 | 0.791939957 |
71300huge | 0.177202338 | 0.777282242 |
29727star | 0.189608358 | 0.94537057 |
84647charg | 0.193639941 | 1.107387953 |
45608fun | 0.156195461 | 0.769228367 |
16666close | 0.146094318 | 0.717195468 |
63566coupl | 0.136324894 | 0.694853203 |
43745shower | 0.146251093 | 0.843647845 |
105560expens | 0.123541624 | 0.695041725 |
81315smoke | 0.176175837 | 1.211246955 |
81581size | 0.111322238 | 0.638559951 |
78951smell | 0.137813083 | 0.880136225 |
40,466 bar | 0.114751923 | 0.694104255 |
79248drink | 0.124056133 | 0.760623902 |
Table 1.
From the table we see, around 40 features which are both important and distinguishing were extracted from the consumer online reviews. For each feature, we calculated the tf-idf value to reflect the frequency of occurrences of the word features, which indicate the importance of the features for representation of the content of the reviews. The evaluation of importance of features was usually determined by consumer surveys in the past.
Next, classification (SVM) is performed using the selected 38 features as the predictive variables. The accuracy is 91.6%. The high accuracy indicated text reviews could be used to represent the true thinking of the consumers toward the hotel, which can be further used to identify the factors that consumers value as when they evaluate the hotels.
Last, factor analysis was applied using principal axis factoring in order to identify the underlying factors of the two hotels. The principle axis factoring analysis with a Varimax rotation showed 14 factors with an eigenvalue of one or greater for the functions of apps. As stated in Table 3, total variance explained by each factor of apps’ functions was also revealed. Specifically, the first factor has an eigenvalue of 3.48, which is 21% of the total variance of seven items. The second factor has an eigenvalue of 1.74, which is 16% of the total variance of seven items. Then the next five factors have an eigenvalue bigger then 1.3. The rest has too small values (either below 1 or close to 1) so we did not include them. Normally, eigenvalues greater than 1.0 are recommended as a criterion. First seven factors are chosen as in Table 2.
Component | Eigen value | % of variance | Cumulative % | SS loadings |
---|---|---|---|---|
1 | 3.48 | 21 | 21 | 1.53 |
2 | 1.74 | 16 | 37 | 1.13 |
3 | 1.57 | 15 | 53 | 1.10 |
4 | 1.51 | 13 | 66 | 0.95 |
5 | 1.41 | 13 | 79 | 0.91 |
6 | 1.32 | 12 | 91 | 0.86 |
7 | 1.30 | 9 | 100 | 0.66 |
Table 2.
Total variance explained.
The factors are labeled as: (1) room, (2) value, (3) Las Vegas specific hotel amenity-casino, (4) other amenities, (5) location, (6) staff, and (7) Las Vegas specific hotel amenity-entertainment. Among the 38 items, three items were deleted for appropriate data reduction for future statistical analysis. As you can see in Table 4, AF5 (beautiful), AF13 (experience), AF25 (weekend), AF30 (close), and AF33 (expense), were eliminated because they had no significant loading on any of the factors above (factor loading less than 0.20) as in Table 3.
Factor loadings | ||||||||
---|---|---|---|---|---|---|---|---|
Features | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
1) | AF1-room | 0.46 | ||||||
AF14-comfort | 0.52 | |||||||
AF17-bathroom | 0.51 | |||||||
AF18-bed | 0.58 | |||||||
AF22-luxury | 0.28 | |||||||
AF24-spacious | 0.20 | |||||||
AF32-shower | 0.32 | |||||||
AF26-huge | 0.20 | |||||||
AF35-size | 0.26 | |||||||
2) | AF2-stay | 0.22 | ||||||
AF7-service | 0.28 | |||||||
AF20-price | 0.21 | |||||||
AF27-star | 0.25 | |||||||
AF28-charge | 0.38 | |||||||
AF37-bar | 0.28 | |||||||
AF38-drink | 0.35 | |||||||
3) | AF10-casino | 0.52 | ||||||
AF34-smoke | 0.71 | |||||||
AF36-smell | 0.35 | |||||||
4) | AF9-restaurant | 0.45 | ||||||
AF12-shop | 0.47 | |||||||
AF16-food | 0.31 | |||||||
5) | AF6-location | 0.35 | ||||||
AF8-strip | 0.70 | |||||||
AF21-view | 0.34 | |||||||
6) | AF3-staff | 0.60 | ||||||
AF15-friendly | 0.63 | |||||||
7) | AF4-time | 0.29 | ||||||
AF11-pool | 0.32 | |||||||
AF19-show | 0.19 | |||||||
AF23-getaway | 0.22 | |||||||
AF29-fun | 0.23 | |||||||
AF31-couple | 0.31 |
Table 3.
Summaries of features and factor loadings.
3.2 Example 2: clothing industry
In this study, data was obtained from a website which contains information of clothes purchasing and reviews by the consumers.
After preprocessing of the raw reviews we get the term (attribute) by document matrix. For each attribute, we calculated the Gini index of that feature and select only the ones with a Gini value higher then 0.75 [19] and also frequency is higher than the average frequency of the words appearance. Through this we are able to find the major attributes that are both important and distinguishing in the evaluations of the clothes at different category and performed the text classification. The high accuracy (84.9%) indicated text reviews can be used to represent the true thinking of the consumers which can be fatherly used to identify the factors that consumers value as when they evaluate the clothes.
From the narrowed list of both important and distinguishing features, we are able to perform some qualitative diagnostic analysis to identify the determinant attributes for each category and also make the comparisons.
Last, we conduct factor analysis using principal axis factoring in order to identify the underlying factors for each category. As stated in Table 4, we showed the factors for each category and their loading score.
Factor loadings | |||||||||
---|---|---|---|---|---|---|---|---|---|
Features | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
1) | Boot | 0.48 | |||||||
Cute | 0.18 | ||||||||
Denim | 0.38 | ||||||||
Fall | 0.28 | ||||||||
Features | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
Flat | 0.28 | ||||||||
Jacket | 0.61 | ||||||||
Jean | 0.44 | ||||||||
Sandal | 0.40 | ||||||||
Spring | 0.17 | ||||||||
Sweater | 0.15 | ||||||||
Tights | 0.35 | ||||||||
Winter | 0.24 | ||||||||
2) | Knee | 0.44 | |||||||
Leg | 0.22 | ||||||||
Length | 0.57 | ||||||||
Long | 0.35 | ||||||||
Petite | 0.49 | ||||||||
Regular | 0.54 | ||||||||
Short | 0.39 | ||||||||
Sleeve | 0.17 | ||||||||
Torso | 0.24 | ||||||||
3) | Cool | 0.31 | |||||||
Day | 0.35 | ||||||||
Hot | 0.58 | ||||||||
Light | 0.23 | ||||||||
Summer | 0.54 | ||||||||
Warm | 0.26 | ||||||||
Weather | 0.35 | ||||||||
4) | Fit | 0.84 | |||||||
Medium | 0.18 | ||||||||
Normal | 0.23 | ||||||||
Size | 0.71 | ||||||||
Snug | 0.20 | ||||||||
True | 0.46 | ||||||||
Usual | 0.27 | ||||||||
5) | Form | 0.84 | |||||||
Sexy | 0.14 | ||||||||
6) | Casual | 0.29 | |||||||
Classic | 0.12 | ||||||||
Comfortable | 0.17 | ||||||||
Dinner | 0.28 | ||||||||
Elegant | 0.15 | ||||||||
Event | 0.24 | ||||||||
Glove | 0.17 | ||||||||
Features | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
Heel | 0.28 | ||||||||
Night | 0.27 | ||||||||
Occasion | 0.35 | ||||||||
Party | 0.28 | ||||||||
Perfect | 0.34 | ||||||||
Special | 0.31 | ||||||||
Wedding | 0.33 | ||||||||
7) | Black | 0.21 | |||||||
Blue | 0.48 | ||||||||
Color | 0.52 | ||||||||
Dark | 0.36 | ||||||||
Green | 0.37 | ||||||||
Navy | 0.27 | ||||||||
Orange | 0.24 | ||||||||
Pink | 0.27 | ||||||||
Red | 0.34 | ||||||||
White | 0.21 |
Table 4.
Summaries of features and factor loadings for dress.
4. Discussions
In marketing, means-end chain theory is a widely applied theory which is a conceptual cognitive model that suggests consumer decision-making process is a series of cognitive developments through linkages between product attributes, consequences, and value [23] In the context of product usage, product it self is the “means” and the value of the products is the “end.” The product attributes are retained in the minds of consumers at abstract level and can influence the evaluation of the product by the consumers. Means-end theory has been used a lot e-service quality research [7, 24, 25]. Parasuraman et al. [26] applied this theory as the theoretical foundation to develop and conceptualize e-service quality delivered by websites.
In our study, the means are the attributes of the hotels/clothes extracted from online consumer reviews while the end (consequences) are the key areas categorized by factor analysis based on the importance of the attributes also extracted from online consumer reviews through text mining as indicated in Figure 2. Through text-mining and factor analysis, a combination of new and traditional method, we are able to identify the key drivers of consumers decision-making in purchase of two different products.

Figure 2.
Means end chain.
5. Conclusions
A major finding conclusion of our study is that we can utilize the great volume of reviews online to help us identify the key aspects of different product category. Online reviews of products and services are present all over the Internet. Potential consumers value these greatly. Marketers can also get valuable information from reading these reviews. These reviews predominantly contain text-based information. This can be of great value to the marketers: we can form this standardized line of business analysis procedure which can be applied to any business scenarios and offer business insights for business organizations especially for managing products and advertising.
We can utilize text-mining methodology to show that consumers’ attitudes can be accurately predicted by text mining. In addition to making predictions of recommendations, marketers would benefit tremendously by identifying the key information from many thousands of reviews.
A framework was developed by which companies can get this important
Advertisers and marketers would be among the prime beneficiaries once they can glean the appropriate information from text-based reviews. The identified information can either be strongly used in advertising or to improve the business.
References
- 1.
Senecal S, Nantel J. The influence of online product recommendations on consumers’ online choices. Journal of Retailing. 2004; 80 (2):159-169 - 2.
Kumar N, Benbasat I. The influence of recommendations and consumer reviews on evaluations of websites. Information Systems Research. 2006; 17 (4):425-439 - 3.
Lowenstein MW. Customer Retention—An Integrated Process for Keeping your Best Customers. Milwaukee: ASQC Press; 1995 - 4.
Reichheld FF. The one number you need to grow. Harvard Business Review. 2003; 81 (12):46-54 - 5.
Brown TJ, Barry TE, Dacin PA, Gunst RF. Spreading the word: Investigating antecedents of consumers’ positive word-of-mouth intentions and behaviors in a retailing context. Journal of the Academy of Marketing Science. 2005; 33 (2):123-138 - 6.
Shabbir H, Palihawadana D, Thwaites D. Determining the antecedent and consequences of donor-perceived relationship quality—A dimensional qualitative research approach. Psychology and Marketing. 2007; 24 (3):271-293 - 7.
Ladhari R. Developing e-service quality scales: A literature review. Journal of Retailing and Consumer Services. 2010; 17 (6):464-477 - 8.
Sebastiani F. Machine learning in automated text categorization. ACM Computing Surveys. 2002; 34 (1):1-47 - 9.
Kraaij W, Pohlmann R. Viewing stemming as recall enhancement. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; 1996. pp. 40-48 - 10.
Jones S. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation. 1972; 28 (1):11-21 - 11.
Jones S. Index term weighting. Information Storage and Retrieval. 1973; 9 (11):619-633 - 12.
Salton G, Buckley C. Term-weighting approaches in automatic text retrieval. Information Processing and Management. 1988; 24 (5):513-523 - 13.
Baxendale PB. Machine-made index for technical literature: An experiment. IBM Journal of Research and Development. 1958; 2 (4):354-361 - 14.
Luhn HP. A statistical approach to mechanized encoding and searching of literary information. IBM Journal of Research and Development. 1957; 1 (4):309-317 - 15.
Salton G, McGill MJ. Introduction to Modern Information Retrieval. New York, NY: McGraw-Hill; 1953 - 16.
Pang B, Lee L, Vaithyanathan S. Thumbs up? Sentiment classification using machine learning techniques. In: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing; 2002. pp. 79-86 - 17.
Abbasi A, Chen H, Salem A. Sentiment analysis in multiple languages: Feature selection for opinion classification in web forums. ACM Transactions on Information Systems. 2008; 26 (3):12 - 18.
Ng V, Dasgupta S, Arifin S. Examining the role of linguistics knowledge sources in the automatic identification and classification of reviews. In: Proceedings of Conference of Computational Linguistics, Association for Computational Linguistics; 2008. pp. 611-618 - 19.
Aggarwal C, Chen C, Han JW. The inverse classification problem. Journal of Computer Science and Technology. 2009; 25 (3):458-468 - 20.
Vapnik V, Chervonenkis A. A note on a class of perceptrons. Automation and Remote Control. 1964; 25 :103-109 - 21.
Bast E, Kuzey C, Delen D. Analyzing initial public offerings’ short-term performance using decision trees and SVMs. Decision Support Systems. 2015; 73 :15-27 - 22.
Morrison DG. On the interpretation of discriminant analysis. Journal of Marketing Research. 1969; 6 (2):156-163 - 23.
Gutman J. Analyzing consumer orientations toward beverages through means-end chain analysis. Psychology and Marketing. 1984; 1 (3-4):23-43 - 24.
Rowley J. An analysis of the e-service literature: Towards a research agenda. Internet Research. 2006; 16 (3):339-359 - 25.
Wolfinbarger M, Gilly MC. eTailQ: Dimensionalizing, measuring and predicting etail quality. Journal of Retailing. 2003; 79 (3):183-198 - 26.
Parasuraman A, Zeithaml VA, Malhotra A. ES-QUAL a Multiple-Item Scale for Swipe and Book: Functionality of Hotel Reservation App. 2005