Computational Statistics with Dummy Variables

Adji Achmad Rinaldo Fernandes; Solimun; Nurjannah

doi:10.5772/intechopen.101460

Abstract

Cluster analysis is a technique commonly used to group objects and then further analysis is carried out to obtain a model, named cluster integration. This process can be continued with various analyzes, including path analyzes, discriminant analyzes, logistics, etc. In this chapter, the author discusses the reason to use dummy variables in this type of cluster analysis. Dummy variables are the main way that categorical variables are included as predictors in modeling. With statistical models such as linear regression, one of the dummy variables needs to be excluded, otherwise the predictor variables are perfectly correlated. Thus, usually if a categorical variable can take k values, we only need k-1 dummy variables, the k-th variable being redundant, it does not bring any new information. When more dummy variables than needed are used this is known as dummy variable trapping. The advantage to use dummy variables is that they are simple to use and the decision making process is easier to manage. The novelty in this chapter is the perspective of the dummy variable technique using cluster analysis in statistical modeling. The data used in this study is an assessment of the provision of credit risk at a bank in Indonesia. All analyzes were carried out using software R.

Keywords

dummy
cluster
integrated cluster with logistic regression
integrated cluster with discriminant analysis
integrated cluster with path analysis

Author Information

Show +

Adji Achmad Rinaldo Fernandes*
- Departement of Statistics, University of Brawijaya, Malang City, Indonesia
Solimun
- Departement of Statistics, University of Brawijaya, Malang City, Indonesia
Nurjannah
- Departement of Statistics, University of Brawijaya, Malang City, Indonesia

*Address all correspondence to: fernandes@staff.ub.ac.id

1. Introduction

The application of cluster analysis is commonly used to group objects. Cluster analysis can be used to group objects and then further analysis is carried out to obtain a model, namely cluster integration. Cluster integration can be continued with various analyzes, including path analysis, discriminant analysis, logistics, etc. In cluster integration with path analysis, it aims to group homogeneous objects into one group, the goal is that the resulting residual variance is homogeneous in addition to maximizing the adjusted R² value. In cluster integration with discriminant analysis, the benefits of cluster analysis generated can maximize the accuracy, sensitivity, and specificity of the model. In this chapter, we will explain the technical perspective of dummy variables using cluster analysis in statistical modeling, such as regression analysis, path analysis, and discriminant analysis.

2. Why use dummy variables

Dummy variables are numerical variables that represent categorical data, such as gender, race, political affiliation, etc. Technically, the dummy variable is dichotomous, a quantitative variable. Their value range is small; they can only take two quantitative values. As a practical matter, regression results are easiest to interpret when the dummy variable is constrained to two specific values, 1 or 0. Typically, 1 represents the presence of a qualitative attribute, and 0 represents its absence. Categorical variables have more than two categories, can be represented by a set of dummy variables, with one variable for each category. Numerical variables can also be coded to explore nonlinear effects. Dummy variables are also known as indicator variables, design variables, contrasts, one-hot coding, and binary basis variables [1].

Dummy variables are the main way that categorical variables are included as predictors in modeling. For example, in linear regression analysis, the response variable is profit, and the predictor variable is employee group. With statistical models such as linear regression, one of the dummy variables needs to be excluded (by convention, the former, or the latter), otherwise, the predictor variables are perfectly correlated [2].

When defining dummy variables, a common mistake is to define too many variables. If a categorical variable can take on k values, then you tend to define k dummy variables. You only need k-1 dummy variable.

The k-th dummy variable is redundant; it does not bring any new information. And that creates a severe multicollinearity problem for analysis. Using k dummy variables when only k-1 dummy variables are needed is known as dummy variable trapping.

Regression analysis treats all independent variables (X) in the analysis as numerical. A numeric variable is an interval or ratio scale variable whose values can be directly compared, e.g. “10 is double 5,” or “3 minus 1 equals 2.” However, you may want to include a nominal scale attribute or variable such as: “Product Brand” or “Defect Type” in your study. Say you have three types of defects, numbered “1,” “2” and “3.” In this case, “3 minus 1” means nothing. You cannot subtract handicap 1 from handicap 3. The numbers here are used to indicate or identify the degree of “Type of Disability” and have no intrinsic meaning of their own. A dummy variable is created in this situation to “trick” the regression algorithm into the correct attribution of the analysis variable [3].

The main benefit of dummy variables is that they are simple. Often there are better alternative basis functions, such as orthogonal polynomials, effect coding, and splines. If dummy variables are used in linear regression analysis, then there are several advantages [4], including:

The dependent variable prediction process becomes more focused and accurate, different from ordinary multiple regression
Because the data is not qualitative, the prediction results are easy to interpret
The decision-making process tends to be easy

3. Hierarchical cluster

Cluster analysis (group analysis) is an analytical method that aims to group objects into several groups, objects in groups are homogeneous (same) while other group members are heterogeneous (different) [5]. The procedure for group formation in Cluster analysis is divided into two, namely hierarchical and non-hierarchical methods. Grouping with the hierarchical method is used when there is no information about the number of clusters. The main principle of the hierarchical method is to group objects that have something in common with one group. While the non-hierarchical method is used when information about the number of clusters is known or has been determined [6].

This method starts grouping with two or more objects that have the closest object. Then the process is continued by passing to another object that has second proximity. And so on to form a tree in which there is a hierarchy or level from the most similar to the different. The tree formed by this cluster is also called a dendrogram. This tree is useful for providing deeper clarity on the clustering process.

The stages of grouping data using the hierarchical method are [7]:

Determine k as the number of clusters to be formed.
Each object data is considered as a cluster so that n = N.
Calculate the distance between clusters.
Find two clusters that have the least distance between clusters and combine them (meaning N = n-1).
If n > k, then go back to step 3.

According to [6] in the method of forming groups in the hierarchical method, there are two approaches, namely agglomerative hierarchical methods (Agglomerative Hierarchical Methods) and divisive hierarchical methods (Device Hierarchical Methods). The agglomerative method starts by assuming that each object is a cluster. Then the two objects that have the closest distance are combined into one cluster. The process continues so that in the end it will form a cluster consisting of all objects.

4. Integrated cluster with logistic regression

4.1 Integrated cluster equation model with logistic regression analysis

The model of the integration of cluster analysis with logistic analysis of the dummy variable approach is the same as the general model of multiple linear regression analysis with dummy variables.

The general model of the integrated cluster with logistic analysis can be written in the following Eq. (1).

yi=expβ0+β1x1i+⋯+βpxpi+D1βp+1+D1βp+1x1i+⋯+D1β2p+1xpi+D2βp+2+D2βp+3x1i+⋯+D2β3p+2xpi+⋯+Dqβpq+1+Dqβpq+2x1i+⋯+Dqβpq+1xpi1+expβ0+β1x1i+⋯+βpxpi+D1βp+1+D1βp+1x1i+⋯+D1β2p+1xpi+D2βp+2+D2βp+3x1i+⋯+D2β3p+2xpi+⋯+Dqβpq+1+Dqβpq+2x1i+⋯+Dqβpq+1xpiE1

where,

yi: response variable at the i-th observation unit

xki: the k-th predictor variable on the i-th observation unit

βp: coefficient of the p-th logistic function

Dq: q-th dummy variable

p: number of predictor variables

q: the number of clusters formed is reduced by 1

i: 1, 2, 3, …, n

4.2 Logistics regression analysis assumptions

Before conducting the analysis, several basic principles or assumptions underlie regression analysis, several assumptions that underlie logistic regression analysis, namely [8].

Does not assume a linear relationship between the response variables and the predictor variables.
Predictor variables do not have to be normally distributed.
The response variable does not require the assumption of homogeneity for each level of the predictor variable or the variance does not have to be the same in each category.
The measurement scale on the response variable is discrete or binary (success/failure) and the predictor variable does not require an interval measurement scale.
Using probability sampling, which is a sampling technique to provide equal opportunities for each member of the population to be selected as a member of the sample.
Observation variables are measured without errors (valid and reliable measurement instruments) meaning that the variables studied can be observed directly.

4.3 Integrated cluster analysis method with logistic regression analysis

The linkage used in this study is the Average Linkage and the measurement of the distance between clusters using the Euclidean distance. Determination of the number has been determined in advance, namely as many as 2 and 3 groups. The Average Linkage method is based on the average distance. The table of the number of members in each Cluster in the Integrated Cluster Analysis method with regression analysis is presented in Table 1.

Cluster	Number of cluster members
Cluster	3 Groups	2 Groups
1	71	93
2	15	7
3	14	—

Table 1.

Number of members of each cluster average linkage method on integrated cluster analysis method with logistics regression analysis.

From Table 1 it can be seen that there are 71 customers in Cluster 1 with 3 groups, 15 customers in Cluster 2, and 14 customers in Cluster 3. While many members with 2 groups in Cluster 1 as many as 93 customers and in Cluster 2 as many as 7 customers. The selection of the best linkage and model validity is by choosing the model that has the largest total R2, as shown in the equation, which can be briefly seen in Table 2 as follows.

	R2 adjusted of Y₁	R2 adjusted of Y₂	Total R2 adjusted
3 Groups	0.4258	0.8492	0.8923
2 Groups	0.3852	0.8129	0.8667

Table 2.

Adjusted values R2 for each integrated cluster analysis model with logistics regression analysis.

Based on Table 2 the logistic regression analysis model with cluster integration with 3 groups has the greatest total determination value so that logistic regression analysis with cluster integration with 3 groups is the best model compared to 2 groups. The total determination value of 89.23% is considered very good to describe the model.

Based on Table 2 the adjusted R2 value of the Cluster integration regression analysis with 3 groups resulted in an adjusted R2 value of 0.4258 meaning that the variables of age, work experience, and loan to value were able to explain the diversity of credit collectibility variables of 42.58%, while 57 The other, 41% is influenced by variables outside the model. The value of R2 adjusted Cluster integration logistic regression analysis with 3 groups resulted in an R2 adjusted value of 0.8492, meaning that the variables of age, work experience, and loan to value were able to explain the diversity of credit collectibility variables of 84.92%, while 13.08. The other percentage is influenced by variables outside the model. The coefficient of total determination of the Cluster integration logistic regression analysis model with 3 groups is 0.8923, so it can be concluded that the diversity of data that can be explained by the model is 89.23% while the remaining 10.17% is explained by variables outside the model.

The results of R2 the adjusted integrated cluster in logistic regression analysis with 3 groups having the highest adjusted R2 value. If the average variables of each Cluster are compared, it is found that most of Cluster 2 has the highest average value compared to other Clusters, so Cluster 2 is high. While Cluster 1 has the lowest average value compared to other Clusters, so Cluster 1 is low. The average value for each cluster is presented in Table 3.

Variable	Average
Variable	Cluster 1: low cluster	Cluster 2: high cluster	Cluster 3: medium cluster
Age (X₁)	39.507	37.333	38.571
Work experience (X₂)	39.930	193.867	107.571

Table 3.

Average value and each cluster in integrated cluster analysis model with logistic regression analysis.

Based on Table 3, it can be seen that most of the customers are 39 years old in the low cluster, 37 years old in the high cluster, and 38 years old in the medium cluster. The work experience of customers in the low cluster is mostly for 40 months, the high cluster is mostly for 194 months, while in the medium cluster mostly for 108 months.

Integrated Cluster Analysis method with Logistic Regression Analysis with 3 groups that separate each data set optimally. Then the model formed is like Eq. (2) as follows.

πx=exp−0027x1−0041x2+0850y1−0374D1x1−0006D1x2+9971D1y1+0090D2x1+0026D2x2−1559D2y11+exp−0027x1−0041x2+0850y1−0374D1x1−0006D1x2+9971D1y1+0090D2x1+0026D2x2−1559D2y1E2

Low cluster (D1=0 and D2=0) can be seen in Eq. (3).

πx=exp−0027x1−0041x2+0850y11+exp−0027x1−0041x2+0850y1E3

High cluster (D1=1 and D2=0) can be seen in Eq. (4).

πx=exp−0401x1−0047x2+10821y11+exp−0401x1−0047x2+10821y1E4

Medium cluster (D1=0 and D2=1) can be seen in Eq. (5).

πx=exp0063x1−0015x2−0709y11+exp0063x1−0015x2−0709y1E5

5. Integrated cluster with discriminant analysis

5.1 Discriminant analysis

Discriminant analysis is a multivariate analysis that functions to model the relationship between a categorical response variable and one or more quantitative predictor variables [9]. Discriminant analysis can be used as a grouping method because it produces a function that cancan distinguishes between groups. The function is formed by maximizing the distance between groups. If the response variable or categorical data consists of only two groups, it is called a Two-Group Discriminant Analysis model, whereas if the group consists of more than two categories it is called Multiple Discriminant Analysis. Discriminant analysis has two assumptions that must be met, namely the assumption of multivariate normality, and the assumption of homogeneity of the variance matrix.

According to [6], discriminant analysis is included in the multivariate dependence method. The model can be written as in Eq. (6).

yi=β1X1i+β2X2i+…+βpXpiE6

where,

yi: the response variable is categorical or nominal data on the i-th observation unit

Xpi: the p-explanatory variable on the i-th observation unit

βp: the coefficient of the p-th discriminant function

i: 1, 2, 3, …, n

5.2 Integration of cluster analysis with discriminant analysis of dummy variable approach

Integration of Cluster Analysis with Discriminant Analysis The Dummy Variable Approach in this study combines cluster analysis with discriminant analysis. Integrating cluster analysis with discriminant analysis can be done by using dummy variables obtained from cluster results. Many clusters formed are used as categories, then used as dummy variables.

An integrated cluster model with discriminant analysis can be written in Eq. (7).

yi=β1x1i+β2x2i+…+βpxpi+D1βp+1x1i+D1βp+2x2i+…+D1βp+qxpi+D2βp+q+1x1i+D2βp+q+2x2i+…+D2βp+2qxpi+…+Dqβp+qq+1x1i+Dqβp+11+2x2i+…+Dqβp+qqxpiE7

where,

yi: response variable at the i-th observation unit

xpi: the p-explanatory variable on the i-th observation unit

βp: the coefficient of the p-th discriminant function

Dq: q-th dummy variable

p: number of explanatory variables

q: the number of clusters formed is reduced by 1

i: 1, 2, 3, …, n

If the research variables used are 3 and the number of clusters is 2, then the integrated cluster model with multiple discriminant analysis can be written as in Eq. (8).

Common models:

yi=β1x1i+β2x2i+β3x3i+D1β4x1i+D1β5x2i+D1β6x3iE8

Cluster 1 (D1=0)

yi=β1x1i+β2x2i+β3x3iE9

Cluster 2 (D1=1)

yi=β1+β4x1i+β2+β5x2i+β3+β6x3iE10

5.3 Model efficiency

Efficiency can be seen based on three criteria, namely model accuracy, sensitivity, and specificity. Accuracy measures how correctly a diagnostic test identifies and excludes a certain condition, in other words, accuracy is used to measure the goodness of the model. In diagnostic tests, the terms sensitivity and specificity are also known. Sensitivity and specificity in diagnostic tests is a measure of the ability to correctly identify objects under reality [10]. The difference is that sensitivity measures the positive group while specificity measures the negative group. To get the value of accuracy, sensitivity, and specificity can use the Confusion Matrix as follows (Table 4).

Actual	Prediction
Actual	Z₁	Z₀
Z₁	a	b
Z₀	c	d

Table 4.

Confusion matrix.

Accuracy=a+da+b+c+dSensitivity=aa+cSpecificity=db+dE11

5.4 Implementation of integrated cluster with discriminant analysis

For example, there are secondary data regarding homeownership loans obtained from Bank X in Indonesia, where the variables studied are age, credit period, loan to value, and credit collectibility status. The collectibility status of the credit used consists of two categories, namely the collectibility of current and non-current loans. The variables of age and credit period are in hours, while the loan to value is in proportion units. Therefore, it is necessary to standardize before conducting data analysis.

When using an integrated cluster with discriminant analysis, the first thing we have to do is perform a cluster analysis to get a dummy variable. In cluster analysis, it is not necessary to test assumptions because cluster analysis is included in exploratory analysis. If the results of the cluster analysis are n clusters, then the dummy variables formed are n-1 variables. The analysis used is hierarchical cluster analysis with the average linkage method with Euclidean distance. The determination of the number of clusters is determined based on the Silhouette value. Silhouette values for each of the many clusters can be seen in Table 5.

Number of clusters	Silhouette value
2	0.4491
3	0.3915
4	0.2912
5	0.2811

Table 5.

Cluster analysis silhouette results.

Based on Table 5, the largest Silhouette value is in many clusters 2. So that the optimum number of clusters is 2. The results of cluster analysis are obtained in cluster 1 consisting of 71 customers, and cluster 2 consisting of 29 customers. Thus, the dummy variable formed is 1 dummy variable. If the object (customer) is included in cluster 2, we assume that the object is 1 in the dummy variable. Meanwhile, if the object is included in cluster 1, we assume that the object is 0 in the dummy variable.

After obtaining the dummy variable, the next step is to test the assumptions in discriminant analysis. Testing for multivariate normality using the Shapiro-Wilk test on predictor variables, and testing the homogeneity of the covariance matrix using the Box M test. of 0.9917 (> 0.05). So it can be concluded that the data already meet the assumptions of multivariate normality and homogeneity of the variance matrix.

Next is to analyze the data using an integrated cluster with discriminant analysis. Based on the analysis carried out, an integrated cluster model was obtained with the following discriminant analysis:

yi=0,0838x1i+0.0606x12i−0,0241x3i+0,0569D1x1i+0,0358D1x2i−0,0752D1x3iE12

Cluster 1 (D₁ = 0)

yi=0,0838x1i+0.0606x12i−0,0241x3iE13

Cluster 1 (D₁ = 1)

yi=0,1407x1i+0.0964x12i−0,0993x3iE14

Based on the above equations, it can be interpreted that the coefficient of age and credit term is positive, meaning that the higher the age and credit term, the greater the possibility that customers in cluster 1 and cluster 2 have current credit collectability. On the other hand, loan-to-value has a negative coefficient, so if the value increases, it will increase the possibility of customers having non-current credit collectibility. The variable that most influences credit collectibility in cluster 1 and cluster 2 is age which has the largest discriminant coefficient value. The value of classification accuracy, sensitivity, and specificity in the integrated cluster analysis method with discriminant analysis can be seen in Table 6 below.

	Percentage
Classification accuracy	84%
Sensitivity	84%
Specificity	16%

Table 6.

Value of classification accuracy, sensitivity, and specificity.

Based on Table 6, the results of the classification accuracy are 84%, which means that the model correctly classifies as many as 84 customers out of 100 customers. Sensitivity of 84% means that customers belonging to the current category can be classified correctly by the model as many as 60 of 71 customers. The specificity of 16% means that customers belonging to the non-current category can be classified correctly by the model as many as 5 out of 29 customers.

6. Regression analysis with dummy variable

6.1 Regression analysis

The method that describes how big the relationship between variables is a regression analysis method. Regression analysis is divided into two, namely simple regression analysis and multiple regression analysis. Simple regression analysis is an analysis involving one predictor variable and one response variable, while multiple regression analysis is a regression analysis involving several predictor variables and one response variable. The regression analysis has several classical assumptions based on Gauss-Markov theory that must be met, namely the relationship between variables is correct, predictor variables are fixed or non-stochastic, homogeneity of variance, non-autocorrelation, error normality, non-multicollinearity [11].

6.2 Regression analysis with dummy variables

There are many ways to create a regression model with qualitative predictor variables, one of which is to use regression with dummy variables. The dummy variable is a variable used to obtain an estimator in a regression model involving qualitative predictor variables [12]. There is no difference in the assumptions underlying the regression with or without a dummy variable, this is because the addition of a dummy variable will be the same as the addition of a predictor variable in general.

There are several rules for coding dummy variables, for example by using binary code (0, 1). For example, there is a qualitative predictor variable with two categories (category 1 and category 2), then the qualitative variable can be defined in the dummy variable as shown in the following equation:

D=1,for category10,for otherE15

The regression model with dummy variables can be expressed in the following equation:

Yi=β0+β1D+β2X2+β3X3+β4X4+εiE16

Information:

Yi: the value of the response variable at the i-th observation.

Xi: the value of predictor i-th variable.

β: regression model parameter.

εi: Random error at i-th observation.

i: index for observationi=12…n.

Dummy variables can be entered into the regression model in three different ways, namely:

Dummy variable as intercept component
Dummy variable as slope component
Dummy variables as components of intercept and slope

6.3 Application of regression analysis with dummy variables

From the available data, namely Y = willingness to pay, X₁ = dummy variable with category 1 being income in one family that is not combined, while category 2 is income in one family combined. X2 is Service Quality, X3 is Environment and X4 is Fairness.

The regression model formed is Y=b0+b1D+b2X2+b3X3+b4X4 (Figure 1)

Based on the regression analysis performed, the regression model with dummy variables is obtained as follows:

Y=0.54088+0.08676D+0.1579X2+0.4309X3+0.2545X4E17

In this model, it is possible to know the difference in interest in paying creditors whose income is combined with income that is not combined.

6.4 Assumptions of regression analysis with dummy variable

6.4.1 Non multicollinearity

Multicollinearity is a problem in regression which means that the predictor variables correlate. A good regression model is a data that there is no multicollinearity problem. Multicollinearity checks can use the VIF value, where if the VIF value is <10 then there is no multicollinearity problem. In the data used, the VIF value for all variables is less than 10 so that the data is used to fulfill the assumption of non-multicollinearity.

6.4.2 Normality error

The assumption of normality of error is an assumption that requires that the error must be normally distributed with a mean of 0 and a variance σ2. Testing for normality of errors can use the Shapiro Wilk test.

H0: normal distribution error

H1: error is not normally distributed

α=5%

Based on the normality test, a p-value of 0.91 was obtained, which means that the error was normally distributed. So that the assumption of normality error is met.

6.4.3 Non autocorrelation

The non-autocorrelation assumption test aims to find out whether some observations have correlated errors or not. If there is covariance and the correlation between errors is not equal to zero, then this can be said as a violation of assumptions. The non-autocorrelation assumption test method can be done using the Durbin Watson method. Based on the analysis conducted using the Durbin Watson test, a p-value of 0.6132 was obtained, which means that the data met the non-autocorrelation assumption.

6.4.4 Homoscedasticity

The assumption of homoscedasticity invariance indicates that as the average increases, the variance should remain constant, but there is a possibility that an increase in the average value causes the variance value to also increase, so it is necessary to test the assumption of homogeneity of variance. Assumption testing is done so that the estimator results obtained are efficient. Testing the assumption of homoscedasticity can use the Brusch Pagan method.

Based on the analysis conducted using the Brusch Pagan test, a p-value of 0.130 (less than 0.05) was obtained, which means that the data met the assumption of homoscedasticity.

6.5 Parameter significance test

Simultaneous test
H0: β1=β2=β3=β4=0
H1: there is at least one βi≠0
α=5%
Based on the analysis obtained a p-value of 0.000 which means that there is at least one significant regression coefficient.
Partial test
H0: βi=0
H1: βi≠0
α=5%

Based on the analysis, it was found that three regression coefficients have a p-value of less than 0.05. The three regression coefficients are the coefficients of the variables X2 (Quality of Service), X3 (Environment), and X4 (Fairness). This means that Service Quality, Environment, and Fairness have a significant effect on Willingness to Pay.

6.6 Model interpretation

The model obtained and has fulfilled all the assumptions of regression analysis with dummy variables is as follows:

y=0.54088+0.08676D+0.1579x2+0.4309x3+0.2545x4E18

In this model, it is possible to find out the difference in interest in paying creditors whose income is combined with income that is not combined. Based on the model obtained, the coefficient of the dummy variable is 0.08676, which means that when the incomes of creditors in one family are combined, the willingness to pay will be greater than those of creditors whose income is not combined. The estimated regression coefficient for the variable X2 (Quality of Service) is 0.1579, which means that the better the bank’s service quality, the willingness to pay for credit also increases. Then for the estimation of the regression coefficient for the X3 (Environmental) variable, an estimate of 0.4309 is obtained, which means that the better the creditor’s environmental conditions, the willingness to pay credit will also increase. The same thing also happened to the variable X4 (Fairness) where the estimated regression coefficient was 0.2545, which means that if the bank institution is fairer, creditors will also be more interested in paying.

7. Conclusion

The use of cluster analysis in statistical modeling will greatly facilitate the capture of the diversity of objects so that objects with the same characteristics can be grouped into the same group. This will be useful in classification methods such as discriminant analysis. Because in one group, objects will be more homogeneous, while between groups has a high diversity. So, the novelty in this chapter is the perspective of the dummy variable technique where the number of categories in the dummy variable is determined by the number of clusters formed from the results of cluster analysis. This will then be continued on statistical modeling which is able to help researchers to divide objects into several groups according to the characteristics of each object by minimizing the diversity within the group.

Conflict of interest

The authors declare no conflict of interest.

References

1. Stattrek.com. Dummy Variables in Regression. 2021. Available from: https://stattrek.com/multiple-regression/dummy-variables.aspx
2. Displayr.com. What are Dummy Variables?. 2019. Available from: https://www.displayr.com/what-are-dummy-variables/
3. Skrivanek S. The Use of Dummy Variables in Regression Analysis. Powell, OH: More Steam, LLC; 2009
4. Artaya IP. Analisa Regresi Linier Berganda Metode Dummy Banyak Kriteria. 2019. DOI: 10.13140/RG.2.2.13471.41122
5. Fernandes AAR. Metode Statistika Multivariat Pemodelan PErsamaan Struktural (SEM) Pendekatan WarPLS. Malang: UB Press; 2017
6. Johnson RA, Wichern DW. Applied Multivariate. Analysis. Upper Saddle River. NJ: Prentice-Hall; 2007
7. Gudono. Analisis Data Multivariat Edisi Pertama. Yogyakarta: BPFE; 2011
8. Tatham RL, Hair JF, Anderson RE, Black WC. Multivariate Data Analysis. New Jersey: Prentice Hall; 1998
9. Wong HB, Lim GH. Measures of Diagnostic Accuracy: Sensitivity, Specificity, PPV and NPV. Proceedings of Singapore Healthcare. 2011;20(4):316-318
10. Gujarati D. Ekonometri Dasar: Terjemahan Sumarno Zein. Erlangga: Jakarta; 2003
11. Nawari. Analisis Regresi dengan MS Excel 2007 SPSS 17. Jakarta: PT Elex Media Komputindo; 2010
12. Le Cessie S, Van Houwelingen JC. Logistic regression for correlated binary data. Journal of the Royal Statistical Society: Series C (Applied Statistics). 1994;43(1):95-108

[1] 1. Stattrek.com. Dummy Variables in Regression. 2021. Available from: https://stattrek.com/multiple-regression/dummy-variables.aspx

[2] 2. Displayr.com. What are Dummy Variables?. 2019. Available from: https://www.displayr.com/what-are-dummy-variables/

[3] 3. Skrivanek S. The Use of Dummy Variables in Regression Analysis. Powell, OH: More Steam, LLC; 2009

[4] 4. Artaya IP. Analisa Regresi Linier Berganda Metode Dummy Banyak Kriteria. 2019. DOI: 10.13140/RG.2.2.13471.41122

[5] 5. Fernandes AAR. Metode Statistika Multivariat Pemodelan PErsamaan Struktural (SEM) Pendekatan WarPLS. Malang: UB Press; 2017

[6] 6. Johnson RA, Wichern DW. Applied Multivariate. Analysis. Upper Saddle River. NJ: Prentice-Hall; 2007

[7] 7. Gudono. Analisis Data Multivariat Edisi Pertama. Yogyakarta: BPFE; 2011

[8] 8. Tatham RL, Hair JF, Anderson RE, Black WC. Multivariate Data Analysis. New Jersey: Prentice Hall; 1998

[9] 9. Wong HB, Lim GH. Measures of Diagnostic Accuracy: Sensitivity, Specificity, PPV and NPV. Proceedings of Singapore Healthcare. 2011;20(4):316-318

[10] 10. Gujarati D. Ekonometri Dasar: Terjemahan Sumarno Zein. Erlangga: Jakarta; 2003

[11] 11. Nawari. Analisis Regresi dengan MS Excel 2007 SPSS 17. Jakarta: PT Elex Media Komputindo; 2010

[12] 12. Le Cessie S, Van Houwelingen JC. Logistic regression for correlated binary data. Journal of the Royal Statistical Society: Series C (Applied Statistics). 1994;43(1):95-108

Computational Statistics with Dummy Variables

Computational Statistics and Applications

Abstract

Keywords

Author Information

Adji Achmad Rinaldo Fernandes*

Solimun

Nurjannah

1. Introduction

2. Why use dummy variables

3. Hierarchical cluster

4. Integrated cluster with logistic regression

4.1 Integrated cluster equation model with logistic regression analysis

4.2 Logistics regression analysis assumptions

4.3 Integrated cluster analysis method with logistic regression analysis

Table 1.

Table 2.

Table 3.

5. Integrated cluster with discriminant analysis

5.1 Discriminant analysis

5.2 Integration of cluster analysis with discriminant analysis of dummy variable approach

5.3 Model efficiency

Table 4.

5.4 Implementation of integrated cluster with discriminant analysis

Table 5.

Table 6.

6. Regression analysis with dummy variable

6.1 Regression analysis

6.2 Regression analysis with dummy variables

6.3 Application of regression analysis with dummy variables

Figure 1.

6.4 Assumptions of regression analysis with dummy variable

6.4.1 Non multicollinearity

6.4.2 Normality error

6.4.3 Non autocorrelation

6.4.4 Homoscedasticity

6.5 Parameter significance test

6.6 Model interpretation

7. Conclusion

Conflict of interest

References

Continue reading from the same book

Computational Statistics and Applications

Your cart