Table 1. Summary information of sample data

## 1. Introduction

In actual test construction, predictions of test score and response-time distributions have become important (Ueno, 2006, Ueno 2007). The following situations exemplify this importance:

In the case of absolute grading in qualifying examinations, the qualifying score is selected in advance. The test score and response-time distributions must be approximately constant so that among tests, the degree of difficulty does not differ.

In selection tests such as entrance examinations at national centres, examinees can select subject choices such as mathematics A or B and science A, B or C. The distributions in both subjects must be approximately similar in both score and response-time distributions so that the degree of difficulty of the test is not different between the subject choices.

In the selection tests, examinees can choose a large problem during tests such as entrance examinations. In this case, the score and response-time distributions must be approximately similar across each large problem so that the degree of difficulty is relatively equal among them.

As mentioned above, test-authors should construct tests while predicting the score and response-time distributions to control the equality of difficulty. Nevertheless, there are many cases in which test-authors construct tests referring to past test constructions. However, it was noticed that the national center for entrance examinations persisted in the aim of showing an average score, but test reliability was seriously ruined in the actual tests (Ueno, 2006, Ueno, 2007).

Recently, with the increasing use of e-testing, test construction utilizing an item bank has become the standard method, allowing predictions of score and response-time distributions during actual test construction. For example, Nagaoka (2000) has developed a test score prediction system using a truncated exponential distribution. Ueno (2005b) has proposed a web-based computerized testing system with a test construction support system employing the mixture binominal distribution as a prediction of score distribution. Meanwhile, the only research which focuses on score distribution is that by Keats and Lord (1962). In this study, they applied the beta-binominal distribution as a score distribution.

On the other hand, a lot of research has been done on test response-time distributions. In the educational engineering field, the response curve has been investigated for analyzing response-time data obtained from response analyzers and computer-based testing systems (for example, Nagaoka, 2003, Nagaoka and Wu, 1989, Ueno, 2005a). Moreover, in the field of psychological measurement, Thissen (1983) applied a lognormal distribution to test response-time data, Verhelst et al. (1997) used a gamma distribution for a speed test, and Roskam (1997) applied the Weibull distribution for test response-time.

As mentioned above, the related studies have rarely showed comparisons among several models from the viewpoint of prediction accuracy in score and response-time distributions. Therefore, we first did some comparison experiments of the prediction models of the test score and response-time distributions using actual data. This paper proposes 1) a mixture beta-binomial distribution, which combines the beta-binomial and mixture binomial distributions; 2) a score distribution based on item response theory (IRTs: the Rasch model (Rasch, 1966) and a two-parameter logistic model (Lord and Novick, 1968)), since few test score prediction models have been proposed. In particular, we compared the proposed models with the binomial distribution (Ueno, 2005b), the mixture binomial distribution (Ueno, 2005b), the beta-binomial distribution, and the truncated exponential distribution (Nagaoka, 2000). In addition, comparison experiments among response-time prediction models were performed. We compared the prediction precision of normal distribution, lognormal distribution, the extended gamma distribution (Ueno, 2005a) and the Weibull distribution for test response-time.

The results show that the score distribution based on IRT (two-parameter logistic model) and the extended gamma distribution produced the best prediction precision for score and response-time distributions, respectively. Furthermore, the prediction tool using the score distribution based on IRT (two-parameter logistic model) and the extended gamma distribution was installed in the e-testing construction support system. Some evaluation experiments done using this system are described here. The results show the effectiveness of the proposed system. In addition, evaluation obtained from questionnaires also confirms the effectiveness of this system.

## 2. E-Testing System

An e-testing system (Songmuang and Ueno, 2005) was developed to support test-authors in creating test items, constructing tests, delivering tests, and to analyzing test history data. The system consists of the following features: 1) an item authoring system, 2) e-testing construction support system, 3) data analysis system, 4) item bank, 5) test database, and 6) test delivery system, as shown in Figure 1. This paper focuses on the e-testing construction support system and the development of a tool to predict test score and response-time distributions.

Test Content | Number of items | Number of examinees | Rate of correct answers | Standard deviation | Average response-time (seconds) | Standard deviation | |

A | Japanese Language Test | 69 | 91 | 0.445 | 0.198 | 1421.05 | 1366.6 0 |

B | Introduction to Computers | 54 | 62 | 0.27 0 | 0.217 | 659.1 0 | 569.54 |

C | Information Theory | 20 | 72 | 0.739 | 0.903 | 204.11 | 284.33 |

D | Network Technology | 37 | 66 | 0.241 | 0.142 | 622.91 | 614.07 |

## 3. Comparison Experiment of the Prediction Models

### 3.1. Method

In this section, to develop a prediction tool for the e-testing construction support system, we evaluated several score and response-time prediction models using actual data. Table 1. lists the sample data for experiments obtained from ‘N’ University. Test A concerned the Japanese Language Proficiency Test. Tests B, C and D focused on information science. The sample data was taken by students who participated in e-learning classes.

The experiment was divided into two parts. Initially, we evaluated the fit of the estimated distributions using the models to the sample data by comparing the difference between the estimated distributions and the sample data. Subsequently, we evaluated the prediction precision of the predicted distributions using the models. In this part, the sample data were randomly divided into halves, which were used to predict the distributions (the training data) and verify the predicted distributions (the validation data). This procedure was repeated 1000 times for each set of test content data.

However, in both parts of the experiment, the evaluation of the score distribution transformed the estimated/predicted score distributions and actual data into discrete distributions in ten stages. The evaluation of the response-time distribution was performed by transforming both the estimated/predicted response-time distributions and actual data into cumulative distributions. The difference between the estimated/predicted distributions and the actual data, which indicates the prediction precision of the models, was calculated using the root mean square error (RMSE).

Fit | Binomial Distribution | Mixture Binomial Distribution | Beta-binomial Distribution | Mixture Beta-binomial Distribution | Rasch Model | 2-Parameter Logistic Model | Truncated Exponential Distribution |

A | 0.00356 | 0.00042 | 0.00052 | 0.00041 | 0.00059 | 0.00041 | 0.00052 |

B | 0.00570 | 0.00235 | 0.00319 | 0.00120 | 0.00160 | 0.00118 | 0.00159 |

C | 0.00766 | 0.00332 | 0.00321 | 0.00262 | 0.00267 | 0.00192 | 0.00277 |

D | 0.00464 | 0.00098 | 0.00178 | 0.00060 | 0.00093 | 0.00075 | 0.00079 |

Prediction | Binomial Distribution | Mixture Binomial Distribution | Beta-binomial Distribution | Mixture Beta-binomial Distribution | Rasch Model | 2-Parameter Logistic Model | Truncated Exponential Distribution |

A | 0.00346 | 0.00055 | 0.00062 | 0.00056 | 0.00067 | 0.00053 | 0.00062 |

B | 0.00591 | 0.00239 | 0.00324 | 0.00134 | 0.00168 | 0.00132 | 0.00173 |

C | 0.00737 | 0.00339 | 0.00331 | 0.00269 | 0.00279 | 0.00179 | 0.00286 |

D | 0.00422 | 0.00105 | 0.00182 | 0.00073 | 0.00095 | 0.00068 | 0.00089 |

### 3.2. Experiment Results of Score Distribution

In this section, we introduce new prediction models for test score distribution. The mixture beta-binomial distribution, which combined the mixture binomial distribution (Ueno, 2005b) and beta-binomial distribution (Keats and Lord, 1962), is defined as follows:

where * i*,(1,…,

*) indicates the*m

*-th item, n indicates the total number of examinees taking the test,*i

*(α*B

*) is the beta function, and α and*,β

*are the estimated parameters of each item.*β

Various IRT models exist, but our approach uses the Rasch model (Rasch, 1966) and the two-parameter logistic model (Lord and Novick, 1968). For the IRT, * u* indicates the response of examinee

_{ij}

*(1,…,*j,

*) on item*n

*,(1,…,*i

*) as follows:*m

The probability of a correct answer in the two-parameter logistic model is:

where * θ* is person

_{j}

*'s ability parameter,*j

*is*a

_{i}

*-th item’s discrimination parameter, and*i

*is the*b

_{i}

*-th item’s difficulty parameter, which represents the degree to which the item discriminates between persons in different regions on a latent ability scale. The Rasch model is a special case of the two-parameter logistic model by setting*i

*= 1. Here, the ability distribution is assumed to be the standard normal distribution. In this paper, we propose a test score distribution based on the IRTs as follows:*a

_{i}

Next, the comparison experiments and results are described. The score distribution models used in this experiment were the binomial distribution (Ueno, 2005b), the mixture binomial distribution (Ueno, 2005b), beta-binomial distribution, the mixture beta-binomial distribution, the truncated exponential distribution (Nagaoka, 2000), and the score distribution based on IRTs (Rasch model and two-parameter logistic model). The parameters of the models were estimated using the sample data.

The fits of the estimated score distribution to the actual data were estimated using RMSE. The results are shown in the top of Table 2. The smallest RMSEs across all models of each test are underlined. According to the results, the score distribution based on the IRT (two-parameter logistic model) gave the best results to fit the score distribution of tests A, B and C. The best model to fit the score distribution of test D data was the mixture beta-binomial distribution, which is also proposed in this paper.

The prediction precisions of the predicted score distribution models were evaluated using the above-mentioned method. Based on the results in the bottom of Table 2, the score distribution based on IRT (two-parameter logistic model) gave the best results in predicting score distributions of data from tests A, B, C and D.

From the above results, the score distribution based on IRT (two-parameter logistic model), which is proposed in this paper, had the best performance results in both evaluations.

### 3.3. Experiment Results of Response-Time Distribution

The response-time distribution models in this experiment were normal distribution, lognormal distribution, the extended gamma distribution (Ueno, 2005a), and the Weibull distribution. The parameters of these models were estimated using the sample data.

The fits of the estimated response-time distributions to the actual data were estimated using RMSE. The results are shown in the top of Table 3. The smallest RMSEs across all models of each test are underlined. According to these results, the extended gamma distribution gave the best results to fit the response-time distribution of data from tests A and D. The best model to fit the response-time distribution of test B data was the normal distribution, while that to fit the response-time distribution of test C data was the lognormal distribution. Thus, we could not determine which model offered the best performance results in this experiment because the results were sparse.

Fit | Normal | Lognormal | Gamma | Weibull |

A | 0.758 | 1.195 | 0.157 | 0.92 |

B | 1.043 | 3.167 | 2.812 | 4.266 |

C | 5.232 | 1.226 | 3.199 | 1.271 |

D | 0.979 | 1.294 | 0.534 | 1.33 0 |

Prediction | Normal | Lognormal | Gamma | Weibull |

A | 4.481 | 1.961 | 1 . 032 | 1. 802 |

B | 4.261 | 4.455 | 1.578 | 4.271 |

C | 4.826 | 2.954 | 1.493 | 1.63 0 |

D | 1.007 | 2.08 0 | 1.002 | 1.22 0 |

Next, the prediction precisions of the response-time distribution models were evaluated using the above-mentioned method. As can be seen in the bottom of Table 3, the extended gamma distribution produced the best results in predicting the response-time distributions of data from tests A, B, C and D.

From the above results, the extended gamma distribution gave the best results in both evaluations.

## 4. E-Testing Construction Support System with Score and Time Prediction Tool

From the results of the previous section, we employed the score distribution based on IRT (two-parameter logistic model) and the extended gamma distribution as the prediction models for the score and response-time distributions in the prediction tool, respectively. The tool is installed into the e-testing construction support system. Figure 2 shows the interface of the developed system. The top frame of the interface shows the test attributes (test name, test ID, etc.). The left frame of the interface displays a list of item IDs and classifies them by difficulty using a colour scheme which changes from green, indicating lower difficulty, to red, indicating a higher degree. When the mouse is clicked on an item ID, the content of the item is presented in a pop-up window, as shown in the Figure 2. Test-authors select items to the test by clicking the check boxes at the left of the item IDs. Then, the selected items are registered in the test memory and shown in the central frame of the interface. The average response-time and the degree of difficulty are shown. Test-authors can change the order of items with the ↑↓ buttons or delete them by clicking the “Delete” button. The prediction tool for score and response-time distributions utilizes statistical history data of registered items in the test memory. The predicted distributions (the score distribution based on IRT (two-parameter logistic model) and extended gamma distribution) are presented in the bottom frame of the interface. In addition, the predicted average score, the predicted standard deviation in score, the predicted average response-time, and the predicted standard deviation in response-time are also presented under each graph. Thus, test-authors can decide whether to discontinue test construction or to replace or to add an item using this prediction tool.

## 5. System Evaluation

This section evaluates the effectiveness of the proposed e-testing construction support system. For this experiment, we previously determined the average score, score standard deviation, average response-time, and the response-time standard deviation, which were the goal attributes of the test, and compared differences between the test goal and actual test attributes constructed using the following systems: the proposed system (system Ⓐ), systemⒶ without the prediction tool (systemⒷ), and system Ⓑ without showing the difficulty and average response-time of items in the central frame (system Ⓒ). Thirty Japanese postgraduates of university ‘N’ were assigned to use systems Ⓐ, Ⓑ and Ⓒ at random in groups of ten, and tests were constructed with each system. Seventy items from Level 1 of the Japanese Proficiency Test were used as response data and obtained from 197 examinees. Concretely, a random sampling of historical data of 97 examinees from the 197 examinees was used for the prediction tool. The RMSEs were calculated using the remaining 100 examinees’ data as the validation data of the test constructed using systems Ⓐ, Ⓑand Ⓒ. The results show that the average RMSE of the test constructed by system Ⓐ was lowest of all of the goal attributes. Moreover, Table 4 shows the results of the multiple comparisons (WSD method) between the tests’ RMSE. * d* is the difference of two compared average pairs, WSD shows the test statistic, * indicates there is a significant difference in the significant level within 5%. According to the results in Table 4, the proposed system and the comparison systems show a 5% significant difference from the results in average score, standard deviation score, average response-time, and standard deviation response-time, indicating the prediction tool of this system was highly effective.

Average RMSE of average score | d | WSD | |

Ⓐ-Ⓑ | 2.267 | 2.088 | * |

Ⓑ-Ⓒ | 1.368 | 2.564 | |

Ⓒ-Ⓐ | 3.635 | 2.811 | * |

Average RMSE of standard deviation score | d | WSD | |

Ⓐ-Ⓑ | 2.946 | 2.651 | * |

Ⓑ-Ⓒ | 4.93 0 | 3.256 | * |

Ⓒ-Ⓐ | 7.876 | 3.57 0 | * |

Average RMSE of average response-time | d | WSD | |

Ⓐ-Ⓑ | 9.387 | 14.962 | * |

Ⓑ-Ⓒ | 11.054 | 16.36 0 | |

Ⓒ-Ⓐ | 20.441 | 18.159 | * |

Average RMSE of standard deviation response-time | d | WSD | |

Ⓐ-Ⓑ | 31.693 | 21.748 | * |

Ⓑ-Ⓒ | 6.363 | 19.89 0 | |

Ⓒ-Ⓐ | 38.056 | 24.167 | * |

## 6. Questionnaire Analysis

Furthermore, we compared the questionnaire results between test-authors who used the prediction tool and those who did not. The contents of the questionnaires and results are listed in Table 5. The questionnaires consisted of multiple-choice questions with five answers: 1 definitely no, 2 no, 3 not sure, 4 yes, and 5 definitely yes.

The results in Table 5 indicate that the average score of questions related to the standard-deviation of the score (with and without the prediction tool) are higher than that of the average of the score. This means that the prediction of the standard deviations in score of the constructed tests was more difficult than predicting the average of the score wherever the system was used or not. This is also similar in the case of the response-time. However, the average scores of all questions answered by test-authors supported by the prediction tool are higher than those answered by test-authors who were not. Moreover, there was a significant difference 1% in all questions between cases where the system was used or not used. This difference shows that it was easier to predict each goal attribute utilizing the proposed system.

Q uestion | W ithout the prediction tool: average score (standard deviation) | W ith the prediction tool: average score (standard deviation) | t | p-value |

Was it difficult to predict the average response-time of the constructed test? | 2.844 (1.322) | 1.968 (1.092) | 2.886 | 0.005 |

Was it difficult to predict the standard deviation of response-time of the constructed test ? | 3.969 (0.782) | 2.313 (1.030) | 7.245 | 0 .001 |

Was it difficult to predict the average score of the constructed test ? | 3.250 (1.244) | 1.813 (0.931) | 5.232 | 0 .001 |

Was it difficult to predict the standard deviation of score of the constructed test ? | 4.125 (0.793) | 2.281 (0.991) | 8.215 | 0 .001 |

## 7. Conclusion

We presented an e-testing construction support system with the prediction tool for predicting the test score and response-time distributions. Moreover, 1) the mixture beta-binomial distribution, which combines the beta-binomial distribution and the mixture binomial distribution, and 2) the test score distribution based on item response theory (Rasch model and the two-parameter logistic model) were proposed in this paper. The proposed models were compared with the traditional models (the binomial, the binomial mixture, the beta-binomial, and the truncated exponential distributions). Furthermore, we compared the traditional response-time distributions (the normal, the lognormal, the extended gamma, and the Weibull distributions).

The results indicated that the score distribution based on item response theory (two-parameter logistic model) for test score distribution and the extended gamma distribution for response-time distribution showed the best prediction performance. Using these results, we developed the e-testing construction support system to visualize the test score and response-time distributions for the constructed test. We showed the effectiveness of the system from evaluation experiments.

We compared the difference of the errors between goal attributes and actual values of tests constructed with and without the system using the actual item bank. In addition, the average score, the standard deviation of score, the average response-time, and the standard deviation of response-time as the test goal attributes were decided prior to the evaluation. The results show that errors in goal attributes can be significantly decreased utilizing the proposed system.

Moreover, evaluations concerning the test construction process of test-authors who used and did not use this system were compared from questionnaires. The results reveal the effectiveness of this system. A future task is to increase type and number of data consider when to evaluating the prediction models.

## References

- 1.
Keats J. A. Lord F. M. 1962 A theoretical distribution for mental test scores. Psychometrika,27 59-72,0033-3123. - 2.
Lord F. M. Novick M. R. 1968 Statistical Theories of Mental Test Scores . Addison-Wesley,0-39434-771-4 - 3.
Nagaoka K. 2000 On Development of Computerized Testing System for Practical Use with Estimation Function. Japan Journal of Educational Technology,24 1 63-72,0385-5236 . (In Japanese) - 4.
Nagaoka K. 2003 Response Curve of Group Learning. In: Educational technology dictionary,265 266 , Jikkyo Shuppan.4-40705-110-8 Japanese) - 5.
Nagaoka K. Wu Y. 1989 Analysis of response time data from computer testing. Japanese Journal of Educational Technology,12 4 129-137,0385-5236. - 6.
Rasch G. 1966 An item analysis which takes individual differences into account. British journal of mathematical and statistical psychology,19 49 57 ,0007-1102. - 7.
Roskam E. E. 1997 Models for speed and time-limit test. In: Handbook of modern item response theory, W.J. van der Linden and R.K. Hambleton (Eds.),187 208 , Springer,0-38794-661-6 York. - 8.
Songmuang P. Ueno M. 2005 e-Testing Management System. Proceedings of World Conference on E-Learning in Corporate, Government, Healthcare, and Higher Education 2005,3139 3148 . - 9.
Thissen D. 1983 Timed testing: An approach using item response theory. In: New Horizons in Testing: Latent Trait Test Theory and Computerized Adaptive Testing, D. Weiss (Ed.),179 203 , Academic Press,0-12742-780-5 York. - 10.
Ueno M. 2005a On-line data-analysis of e-learning response time using gamma distribution. Japan Journal of Educational Technology,29 2 107-117,0385-5236. - 11.
Ueno M. 2005b Web-based computerized testing system for distance education. Educational Technology Research,28 59-69,0387-7434. - 12.
Ueno M. 2006 Introduction to test theory. Trend of University Entrance Examination Research,23 63-70. (In Japanese) - 13.
Ueno M. 2007 Statistical analysis of national center for university entrance examinations (basic information relations). Research Bulletin of National Center for University Entrance Examinations,36 71-99. (In Japanese) - 14.
Verhelst N. D. Verstralen H. H. F. M. Jansen M. G. H. 1997 A logistic model for time-limit tests. In: Handbook of Modern Item Response Theory, W. J. van der Linden and R. K. Hambleton (Eds.),169 185 , Springer,0-38794-661-6 York.