Open access peer-reviewed chapter

Matrix as an Alternative Solution for Evaluating Sentence Reordering Tasks

Written By

Amma Kazuo

Submitted: 19 January 2022 Reviewed: 26 January 2022 Published: 02 May 2022

DOI: 10.5772/intechopen.102868

From the Edited Volume

Matrix Theory - Classics and Advances

Edited by Mykhaylo Andriychuk

Chapter metrics overview

105 Chapter Downloads

View Full Metrics

Abstract

Although sentence reordering is a popular practice in educational contexts its scoring method has virtually remained ‘all-or-nothing’. The author proposed a more psychologically valid means of partial scoring called MRS (Maximal Relative Sequence) where a point is counted for each ascending run in the answer sequence allowing gaps and the final score reflects the length of the longest sequence of ascending elements. This scoring method, together with an additional consideration of recovery distances, was woven into an executable programme, and then transplanted to Excel without having to understand a programming language. However, the use of Excel was severely limited by the number of columns available. This chapter reviews the past practices of evaluating partial scoring of reordering tasks and proposes an alternative solution LM (Linearity Matrix), also executable on Excel, with far smaller consumption of columns and with the idea of calculating the recovery distances as well as MRS scores. Although LM and MRS are different scoring procedures, they both reflect psychological complexity of the task involved. Furthermore, LM is versatile as to the adjustability of adjacency weights as an extended model of Kendall’s tau. Some reflections on practical application are referred to as well as future directions of the study.

Keywords

  • reordering
  • partial scoring
  • recovery distance
  • Excel
  • Kendall’s tau

1. Introduction

Sentence reordering is one of the popular tasks in reading comprehension [1, 2]. Regrettably, in the field of language testing, the scoring method, in practice, has been overwhelmingly ‘all or nothing’, i.e., one can get a full score only if his/her answer matches the correct sequence perfectly. ‘All or nothing’ is simple enough, but excludes the idea of partial correctness, which Alderson, Percsich, and Szabo see as unfair [3]. There is no consideration of the difference in the test-taker’s degrees of performance.

Consider first a reordering task as an example of a reading comprehension question. Text 1 is taken from Japan’s National Centre Examination (2013) with option [b] being the correct answer [4]. The test-takers were told to choose the correct order of the illustrations of a movie story they were presented.

Which of the following shows the order of the scenes as they appear in the movie?

aACBD
bABCD
cBDAC
dBADC

Text 1.

Sample reordering task for reading comprehension. *: correct answer; element codes are rearranged for convenience.

According to the test manual, only [b] gets the point; the other options make no point. The correct sequence [b] contains three continuous ascending runs: A-B, B-C, and C-D, and three discrete ascending runs: A-C, A-D, and B-D. Of these six sequences [a] satisfies 5 matches, [c] 3 matches, and [d] 4 matches. By another criterion, [a] is reached by dislocating one element (either C or B) from the correct sequence, [c] by two elements, and [d] by two elements. By either case, the three distractors are gradable in terms of proximity to the correct sequence. The ‘All or nothing’ scoring method accepts only the perfect answer and ignores the differences in the test-taker’s partially formed construct.

What should be sought is a rational evaluation method for a partial achievement of item reordering tasks. This paper first reviews some past literature concerning this issue, followed by an overview of a stretch of alternative measurement methods developed by the author. The core issue is a new measurement scheme Linearity Matrix (LM), which can compensate for the shortcomings of the present practices, ensuring quicker and light-weight processing for non-specialists to handle.

Advertisement

2. Literature overview

Alderson, et al. examine four alternative methods to seek fairness and high discriminability: (1) ‘Exact matching’, (2) ‘Previous’, (3) ‘Next’, and (4) ‘Edges’ [3]. They were all in a test stage and no clear conclusion was reached. Above all, these methods were empirically designed on the basis of ad hoc assumptions. Of them (4) ‘Edges’ is a linear extension of (1) ‘Exact matching’, and (2) ‘Previous’ and (3) ‘Next’ are variations of ‘Adjacent matching’. ‘Exact matching’ requires each element to be located exactly in the same position as in the correct answer. In option [a] of Text 1, A and D will get points; the other options [c] and [d] will get no points because there is no element in the correct position. In ‘Adjacent matching’ each of the three adjacent pairs will get 1 point. None of the three distractors [a], [c], and [d] will get a point in our example. But if we had an option

eCDAB

the initial pair C-D and the final pair A-B would each get 1 point.

Kendall’s coefficient tau is originally one of the measurement methods of rank-order correlation [5]. Kendall’s tau is defined as:

tau=[number of concordant pairs][number of discordant pairs]/binomial of choosing2fromn

In other words,

τ=4Pnn11E1

where

P is the total number of items ranked after a given item by both rankings, and

n is the number of items

Kendall’s tau is a popular tool in evaluating correspondence between machine-translated passages and human translations [6, 7, 8]. Papineni, Roukos, Ward and Zhu, for example, measured tau for the degree of correspondence between the n-gram of the reference text and that of the target text produced as a result of machine translation [9]. This study as well as other papers of the same interest deals with open-ended elements for comparison where additions and reductions of words and phrases naturally occur, hence irrelevant to the present scope of item reordering in which the elements are closed.

Bollegala, Okazaki and Ishizuka’s ‘Average continuity’ [10] is a variation of ‘Adjacent matching’:

AC=exp1k1i=2klogPi+αE2

where

k is the maximum number of continuous elements to be considered for calculation,

α is any small number (e.g., 0.001 in Bollegala, et al’s example),

and ‘precision of n continuous sentences’:

Pn=m/Nn+1E3

where

n is the length of continuous elements,

m is the number of continuous elements in the correct order, and

N is the number of elements in the correct sequence.

For example, when evaluating an answer CDABE for a correct sequence ABCDE, N = 5 (length of ABCDE), k = = 5, m = 2 (count of CD and AB), and n = 2 (length of CD or AB).

Their method is sensitive to continuously running elements such as CD or AB, or ABCD in ABCDE with few disorderly elements. In our previous example of Text 1, none of the distractors contains a sequence of elements long enough to get a Pn, resulting invariably in AC =exp(log α).

Below is a simulation of AC against the exclusive permutations of five elements (A, B, C, D, E) where the sample size is 120, given the correct sequence ABCDE. It is only when the length of the continuous sequence is 4 (i.e., ABCD and BCDE) that the AC value appears reasonable (= 0.407 when α = 0.001); otherwise, the values are generally low1. This tendency is enhanced when α takes a smaller value (Table 1)2.

AlphaMean ACSD of AC
0.0010.0870.057
0.00010.0430.045
0.0000010.0120.020
0.000000010.0040.015

Table 1.

Mean and SD of AC scores for all permutations of five elements. (A perfect sequence is excluded as an outlier).

Therefore, AC is not an appropriate measurement tool when shorter continuous sequences (i.e., two or three consecutive elements such as AB and ABC) are not infrequent. Furthermore, AC cannot count the cases where ascending elements are not adjacent (e.g., ADBEC, where AC = 0.050)3.

Lapata prepared all permutations of orders of eight sentences and calculated the tau value for each text, and compared it with the human rating of comprehensibility [11]. She obtained a significant correlation coefficient of r = 0.45 (N = 64) (p. 478). She also claimed that she confirmed that Kendall’s tau was able to predict the text cohesion by measuring the reading time of the target texts. Although this was one of the few studies validating the effect of randomised sentence order, the final coefficient value alone is not convincing enough to ensure the effect of particular text disturbance on comprehension. Other measurement methods for evaluating disrupted orders could have been equally significant while comprising substantial linguistic differences beneath the surface. Therefore, the next step of this study would be to analyse how the target sequences of sentences are created and organised.

In conclusion, both ‘Exact matching’ and ‘Adjacent matching’ are incomplete and counter to intuition. The problem with ‘Exact matching’ is that it is sensitive to the absolute location of elements, and relative sequence is out of consideration. This scoring method may be appropriate for a question in which absolute location is significant, e.g., topic sentence in a paragraph, and initial cause of sequential events. However, if we consider a sequence of combined cause-effect instances, relative locations should also be rewarded with partial points. In contrast, ‘Adjacent matching’ weighs much on local adjacency. Even though [e] gets 2 points, the two pairs are twisted in relative order.

Advertisement

3. Maximal relative sequence

A new measurement method called ‘Maximal Relative Sequence’ (MRS) was proposed by the author [12, 13, 14, 15]. It was intended to capture the longest possible sequence of ascending run within the answer while allowing gaps of adjacent elements. The MRS score is the number of transitions, i.e., the number of elements in the MRS–1. In an answer CDABE, for example, for which the correct sequence is ABCDE, the longest possible ascending sequence or MRS is either CDE or ABE, and the score is 2. Note that there may be multiple MRS for a single score. It is a special case of Levenshtein distance [16] in the sense that there is no addition or reduction of elements.

MRS is logically and psychometrically endorsed with reference to MED (Minimal Edit Distance) in the following simple relationship.

MRS+MED=full score=number of elements1E4

By MED we mean the minimal number of displacement of elements in the answer sequence required to recover the correct sequence, i.e., the number of displacement from the correct sequence to arrive at a certain answer sequence. Thus the MED for BACDE is 1, and that for EDCBA is 4. Take CDABE, for example. It is reachable by dislocating either (C and D) or (A and B) from the correct sequence ABCDE, with a displacement count of two for each case. Since measuring MED is to count the number of elements subject for displacement (e.g., C and D), counting the number of intact elements (i.e., A, B, and E), namely, the elements for MRS is a complement to MED, hence the Eq. (4). The more displacement from the correct sequence is involved, the remoter the answer sequence is from the correct sequence, both logically and psychometrically. Thus, measurement by MRS is practically equivalent to measurement by MED, bearing an advantage over ‘Exact matching’ and ‘Adjacent matching’ with respect to the cognitive load needed for recovery.

Advertisement

4. Maximal relative sequence with recovery distance

Talking about recovery, MRS is still a rough indicator of partial achievement, however. The major flaw is that it does not consider the precise recovery distance of the elements involved. Two answers BACDE and BCDEA would have the same MRS score of 3 (or 1 displacement of A), but the degree of distortion from the correct answer is obviously different. The author’s MRS + Dist model was an attempt of incorporating the recovery distance, i.e., the total number of elements that the elements subject for recovery have to jump over [17, 18]. The final score will be calculated as follows:

Adjusted score=MRS×1Penalty rateE5

where

Penalty rate=Recovery distance/Maximal recovery distance

where

Maximal recovery distance=n×n1/2

where

n is the number of elements in the sequence.

Table 2 shows some sample recovery effects.

AnswerMRSElements
for recovery
DistancePenalty
rate
MRS + Dist
ABCED3E10.13 × (1–0.1) = 2.7
ABECD3E20.23 × (1–0.2) = 2.4
EABCD3E40.43 × (1–0.4) = 1.8
CDABE2C, D40.42 × (1–0.4) = 1.2
DCBAE1B, C, D60.61 × (1–0.6) = 0.4
EDCBA0A, B, C, D1010 × (1–1) = 0

Table 2.

Sample recovery effects. (Bold letters indicate elements for recovery or ‘disruptors’).

The author coded a script of computer programme using Xojo [19] enabling machine calculation since calculating recovery distance seemed beyond ocular calculation. One reason for this complication was the need to handle crossing constraints. A crossing constraint prevents redundant recovery moves. In the case of CDABE, the correct procedure is:

Step1:CDABECABDEDistance=2
Step2:CABDEABCDEDistance=2
Total distance=4

However, if we started with C, the process would incur an unnecessary extra movement:

Step1:CDABEDABCEDistance=3
Step2:DABCEABCDEDistance=3
Total distance=6

In the first step, C jumped over D, which jumped over C in the second step. This redundancy has occurred because C jumped over an older element D when it moved right, the relative order having been reversed, which had to be rectified in the second step by making D jump over C. We need a set of constraints as follows:

Noelementcancross over an older element when moving right.E6
Noelementcancross overayounger element when moving left.E7

Figure 1 is a flow chart illustrating the procedure of calculating MRS (and Distance as a supplement). The core mechanism of creating MRS is to concatenate ascending elements in the answer sequence into a possible pair and connect its tail with the head of another ascending pair. In the case of CDABE, the first seeds are CD, DE, AB, BE and AE. Some of them grow into larger sequences CDE and ABE. The growth stops here since no more concatenation is possible. These are the MRS and the count of concatenation steps is the MRS score (= 2). A full set of codes is included in [18].

Figure 1.

Flow chart of MRS algorithm.

Advertisement

5. MRS by Excel

Resorting to a computer programme meant that the protocol turned opaque in a black box. In order to secure transparency, the author devised an Excel spreadsheet where he transplanted the computer programme to combinations of Excel functions [20].

Readers can trace the concatenation steps in Sections 2 and 3 in Figure 2.

Figure 2.

Sample excel sheet of MRS and MRS + Dist.

As for the recovery distance, the author introduced the use of Kendall’s tau instead of a special computer programme [21]. Given a definition of tau as (1), P is the total number of ‘behind’ elements. When the correct sequence is ABCDE and the target sequence is CDABE, for example, the elements behind A in the ascending order in the target sequence are B and E (2 elements); behind B comes E (1 element); behind C come D and E (2 elements); behind D comes E (1 element). P, in this case, makes 6. Since P indicates the number of elements that each element has to jump over to make a complete reverse sequence (EDCBA), the maximum of P is

Pmax=n×n1/2E8

where

n is the number of elements.

The answer CDABE is distant from the correct answer by dislocating C and D by 4 occasions of jumping over (Table 2), and it is still distant from the complete reverse by additional 6 occasions of jumping over. It means the sum of recovery distance and P is always constant, Pmax.

ABCDECDABEEDCBA
Items to4+6=10
jump over=Pmax

Therefore recovery distance can be calculated by tau. Figure 2 already includes the column of Distance by tau.

Recovery distance=1τ×n×n1/4E9
Advertisement

6. Linearity matrix

Despite the improved accessibility, a major problem with the Excel calculation is its consumption of columns. If we follow the layout in Figure 2 the number of columns required will soon reach the limit of 16,384 as we increase the size of the sequence. It means n = 12 is the largest possible size. Table 3 shows the number of entire columns required for n = 4 to n = 13, including columns for calculating recovery distance by Kendall’s tau.

n45678910111213
C541021983907741,5423,0786,1501229424582

Table 3.

Size of entire columns (C) required by MRS + Dist.

Yet another idea for representing the mechanism of MRS is to make use of a matrix. Table 4 shows the framework of matrix (Linearity Matrix = LM) in which the relationships of elements are indicated. The value ‘1’ indicates that the row element is correctly followed by the column element. The value will be ‘0’ when the relative order is disrupted. In Table 5 for CDABE, C-A, C-B, D-A and D-B are in the wrong order. The sum of the values is Linearity Matrix score, representing the wellformedness of the answer sequence.

A1111
B111
C11
D1
ELM = 10

Table 4.

Linearity matrix for correct answer ABCDE.

C1001
D001
A11
B1
ELM = 6

Table 5.

Linearity matrix for partially correct answer CDABE.

The greatest advantage of LM is its efficiency in column consumption. When transplanted to Excel the core part of LM requires n × (n–1)/2 columns, resulting in a small size of the entire columns (Table 6). Whereas MRS + Dist by Excel would need 3,080,634 columns (in theory) when n = 20, LM requires only 214.

n45678910111213
C14192532404959708295

Table 6.

Size of entire columns (C) required by LM.

An additional advantage of LM is that it can calculate the recovery distance by counting the number of zero values. Since a pair with ‘1’ mark indicates an ascending run, the sum of values (= LM) is equivalent to P of Kendall’s tau. It means that the count of zero values represents the recovery distance. In Table 5, either C and D have to jump over A and B or A and B have to jump over C and D. In either case, the disruption is solved by removing the elements with zero values.

Furthermore, since zero marks are caused by the elements in incorrect positions, these disruptors are subject to displacement for the entire sequence to be corrected. Without these disruptors, the remaining elements should all be arranged in ascending orders in any combination. Thus we can identify the elements of MRS by identifying a minimal number of steps to remove disruptors. In Table 7, we can get B and E as MRS by removing D, C and A; or B and C by removing D, E and A; or D and E by removing B, C and A. These alternative MRSs are obtainable through different procedures, but the count of removal steps stays the same. Tables 810 show the process of the first case.

D0100
B110
E00
C0
ALM = 3, Distance = 7

Table 7.

Linearity matrix for partially correct answer DBECA.

D010
B11
E0
C
ALM = 3, Distance = 3

Table 8.

Linearity matrix after removing disruptor A (step 1).

D
B11
E0
C
ALM = 2, Distance = 1

Table 9.

Linearity matrix after removing disruptor D (step 2).

D
B1
E
C
ALM = 1, Distance = 0

Table 10.

Linearity matrix after removing disruptor C (step 3).

The optimal strategy for removing the disruptors on Excel has not been found at the moment, but we know in theory that LM provides information of both MRS and recovery distance. By definition, LM counts all pairs of ascending order whereas MRS picks up elements that can form the longest sequence. Therefore in the above sample DE, BE, and BC are all counted for the LM score (=3) but only one of them constitutes an MRS. Similarly, CBAED has also one MRS (CE, CD, BE, BD, AE, or AD), but LM = 6. The different LM scores of the two sequences bearing the same MRS score indicate that the internal structure is different.

As another potential LM can vary the weights for ascending pairs. We could value larger weights for closer pairs and smaller weights for remoter pairs (Table 11 for the correct sequence and Table 12 for a sample answer).

A4321
B432
C43
D4
ESum = 30

Table 11.

Linearity matrix for correct answer (n = 5) with gradient weights.

C4001
D002
A43
B4
ESum = 18

Table 12.

Linearity matrix for partially correct answer (n = 5) with gradient weights.

Figure 3 shows the correlation of scores between gradient LM and of MRS + Dist. The data were taken from a test of reading comprehension in English as a foreign language for Japanese university students (N = 149). They were asked to reorder eight descriptions of events after watching a video of an expository story. The correlation r = 0.950 and the coefficient of determination R2 = 0.902 suggest that LM is a highly reliable alternative to MRS + Dist.

Figure 3.

Plot of scores by gradient LM and MRS + Dist. (The scores are standardised between 0 and 1. The oval represents a density ellipse of probability 0.90).

For part of the test-takers (N = 31), internal consistency was compared among eight measurement methods where the reordering task was part of a larger reading comprehension test including short-answer questions. Table 13 indicates that LM methods are stable and can better capture the test-taker performance.

BinaryExactAdjacentMRStauMRS + DistLM flatLM gradient
0.2730.5340.4790.6240.5590.5600.6750.670

Table 13.

Alpha coefficients of eight measurement methods.

LM is even quicker in processing a vast amount of data than MRS. Table 14 shows a processing time taken by four methods when n = 10. The data (size = 1000) was randomly sampled from all permutations of 10 elements. The programmes for two versions of LM were written by Xojo, the same programming language for MRS. Distance in MRS + Dist used Kendall’s tau formula. Each value was an average of ten trials.

MachineMRSMRS + DistLM flatLM gradient
MacBook Pro43.903.800.500.60
Mac Pro538.70107.1064.9062.50
HP250 (Windows)619.4515.5614.4514.45

Table 14.

Processing time of four measurement methods (s).

It should also be noted that with LM the processing time did not deteriorate as the number of elements increased. Figure 4 summarises the average processing time of the four measurement methods for sequence length of n = 3 to n = 12.

Figure 4.

Processing time(s) by sequence length.

Advertisement

7. Applications and limitations

We have seen so far theoretical considerations of evaluating item reordering except for the analyses of correspondence of methods based on actual test results. When constructing an evaluation scheme in reality, however, various non-theoretical factors come in the way. Take an example from Alderson, et al.’s reordering question called ‘Compaq task’ [3]. Text 2 is the original; Text 3 is what they regarded as partially correct; Text 4 is an alternatively misplaced sequence of my invention. Item codes are rearranged for convenience.

AAtechnicianatCompaq Computers told ofafrantic callhereceivedonthe helpline.BItwas fromawoman whosenewcomputer simply would not work.CShesaidshedtaken the computer out of thebox,pluggeditin, andsatthere for20minutes waiting for something to happen.DThe techguyaskedherwhat happened whenshepressed the power switch.EThe woman replied,What power switch?Text2.Correct sequence.
ASameasText2BSameasText2DThe techguyaskedherwhat happened whenshepressed the power switch.EThe woman replied,What power switch?CShesaidshedtaken the computer out of thebox,pluggeditin,andsatthere for20minutes waiting for something to happen.Text3.Partially correctsequence.
ASameasText2BSameasText2DThe techguyaskedherwhat happened whenshepressed the power switch.CShesaidshedtaken the computer out of thebox,pluggeditin, andsatthere for20minutes waiting for something to happen.EThe woman replied,What power switch?Text4.Incorrect sequence.

Both Text 3 and Text 4 differ from the correct sequence by one dislocation of statement (C). However, while Text 4 is completely unacceptable Text 3 sounds much more acceptable, if not perfectly. The use of past perfective form in Text 3 refers back to the point that occurred before (D). The one dubious element is that the tech guy’s question in (D) is slightly too specific, which could disrupt the natural flow of discourse. More serious is the fact that Text 4 is much less acceptable than Text 3, even though the recovery distance of Text 4 is shorter than that of Text 3. It means that the recovery distance alone is not necessarily a predictor of penalty.

This type of task may be called an a priori (or jigsaw) task: test-takers must read the fragments at sight and reconstruct the original passage. Because the fragments contain a lot of linguistic clues such as tense, reference words, and definiteness markers, the test-takers can use these clues to connect fragments. Yet another task type (a posteriori task) requires test takers to read or hear the entire passage initially and reconstruct the outline by selecting descriptions in the correct order. Alderson et al.’s ‘Queen task’ is of this type. In fact, they admit that the a posteriori type might be more appropriate as a measurement tool of reading comprehension (p. 442). Text 6, based on intact Text 5 [22], is another sample of this type where linguistic clues are as much neutralised as possible.

New neighbours

Mr and Mrs Smith married thirty years ago, and they have lived in the same house since then. Mr Smith goes to work at eight o’clock every morning, and he gets home at half past seven every evening, from Monday to Friday.

There are quite a lot of houses in their street, and most of the neighbours are nice. But the old lady in the house opposite Mr and Mrs Smith died, and after a few weeks, a young man and woman came to live in it.

Mrs Smith watched them for a few days from her window and then she said to her husband, ‘Bill, the man in that house opposite always kisses his wife when he leaves in the morning and he kisses her again when he comes home in the evening. ‘Why don’t you do that too?’

‘Well,’ Mr Smith answered, ‘I don’t know her very well yet.’

Text5.Sample original passage(1)MrSmith wasaseriousman.(2)Anoldlady died.(3)Ayoung couple moved in.(4)MrsSmith made repeated observations.(5)MrsSmith requestedanewaction.Text6.Outline items foraposteriori reordering.

In this paper, we examined the nature of MRS and LM. Both of them are still in an incubation stage in language testing as well as other psychometric measurements. The differential weighting for ascending pairs in gradient LM is a proposed model without empirical evidence. There might be clusters of items to be fixed together. Alternative exchanges of talks in dialogue (as in the Compaq task) are considered an example of high adhesion whereas some kinds of discourse order may not be as adhesive. Nevertheless, it is meaningful to attempt application of various measurement methods and validate psychometric as well as semantic connectivity. For example, flat LM might be suitable for a task of recollecting historical events because reference to the chronological order of events is relevant to all (or most) pairs. When reconstructing a story, in contrast, MRS or gradient LM might be a better tool, because local connections are considered more important than remote connections, and the wellformedness of the story depends on how much the completed sequence of items looks like a string of stories. Finally, describing MRS by matrix is space and time-saving. LM is like a ripple in the pond; if you observe the wave on the shore you can detect where the stone was cast.

Figure 5.

Comparison of AC and MRS ratings.

AlphaMRSMRS + DistLM
0.0010.5240.3680.268
0.00010.4900.3430.253
0.0000010.4160.2880.218
0.000000010.3500.2410.186

Table 15.

Correlation of AC with MRS, MRS + Dist, and LM. (A perfect sequence is excluded as an outlier).

Advertisement

Acknowledgments

The author expresses his deep gratitude to Dokkyo University Information Science Research Institute and to the Multivariate Study Group at SAS Institute Japan for their insightful discussions and suggestions.

Advertisement

Conflict of interest

The author declares no conflict of interest.

Advertisement

Notes

  1. The correlation of AC and MRS is 0.524, but 89% of AC scores are smaller than 0.1 while 51% of MRS is between 0.5 and 0.6 (Figure 5, created by JMP [23]). See Section 2 for MRS.

  2. When N = 5 (i.e., 5-element sequence), AC is severely affected by the value of α (Table 15). See Sections 2 and 3 for MRS, MRS + Dist, and LM.

  3. While AC = 0.050, MRS + Dist. = 0.350 and LM = 0.667. See Sections 2 and 3 for MRS + Dist and LM.

  4. MacBook Pro (8-core Apple M1, 16GB)/OS11.4/Xojo 2021r1

  5. Mac Pro (2 × 2.4GHz 6-core Intel Xeon, 25GB, 1.3 MHz DDR3)/OS10.14/Xojo 2017r2.1

  6. HP250G7–122 (Intel Core i5-8565U, 1.6GHz, 8GB RAM)/Windows10/Xojo 2021r1

References

  1. 1. Alderson JC, Clapham C, Wall D. Language Test Construction and Evaluation. Cambridge: Cambridge University Press; 1995. pp. 52-53
  2. 2. Alderson JC. Assessing Reading. Cambridge: Cambridge University Press; 2000. pp. 219-221
  3. 3. Alderson JC, Percsich R, Szabo G. Sequencing as an item type. Language Testing. 2000;17(4):423-447
  4. 4. National Centre for University Entrance Examinations. Heisei 25 Nendo Honshiken no Mondai [Examination Questions for Academic Year 2013]. Available from: http://www.dnc.ac.jp/sp/data/shiken_jouhou/h25/jisshikekka/ [Accessed: 08 June, 2019]
  5. 5. Kendall MG. A new measure of rank correlation. Biometrika. 1938;30(1–2):81-93. DOI: 10.1093/biomet/30.1-2.81 [Accessed: 19 January, 2022]
  6. 6. Birch A, Osborne M. Reordering metrics for MT. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA, United States: Association for Computational Linguistics; 2011. pp. 1027-1035
  7. 7. Dlougach J, Galinskaya I. Building a reordering system using tree- to-string hierarchical model. Proceedings of COLING 2012; 2013. Available from: https://arxiv.org/abs/1302.3057 [Accessed: 19 January, 2022]
  8. 8. Zechner K. Automatic summarization of open-domain multiparty dialogues in diverse genres. Computational Linguistics. 2002;28(4):447-484
  9. 9. Papineni K, Roukos S, Ward T, Zhu W-J. Bleu: A method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL). Philadelphia: ACL; 2002. pp. 311-318
  10. 10. Bollegala D, Okazaki N, Ishizuka M. A bottom-up approach to sentence ordering for multi-document summarization. Information Processing and Management. 2010;46:89-109
  11. 11. Lapata M. Automatic evaluation of information ordering: Kendall’s tau. Computational Linguistics. 2006;32(4):471-484
  12. 12. Amma K. Partial scoring of sequencing tasks. In: Proceedings of the International Meeting of the Psychometrics Society (IMPS 2007); 9–13 July, 2007. Tokyo, Japan: IMPS; 2007
  13. 13. Amma K. Appraisal of partial scoring in sequencing tasks. In: Proceedings of the JACET 46th Annual Convention (the Japan Association of College English Teachers); 6–8 September 2007. Hiroshima, Japan: The Japan Association of College English Teachers; 2007. pp. 108-109
  14. 14. Amma K. Seijo mondai no bubun saitenho^ to sono programming [partial scoring of sequencing problems and its programming]. Dokkyo Journal of Language Learning and Teaching (Dokkyo University Research Institute of Foreign Language Teaching). 2010;28:1-29
  15. 15. Amma K. Comparison of partial scoring methods in sequencing tasks with reference to internal reliability. In: Proceedings of the JACET 49th Annual Convention (the Japan Association of College English Teachers); 7–9 September 2010. Miyagi, Japan: The Japan Association of College English Teachers; 2010. pp. 160-161
  16. 16. Levenshtein VI. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics-Doklady. 1966;10(8):707-710
  17. 17. Amma K. Partial scoring of sequencing tasks with distance penalty. In: Abstracts of ALTAANZ Conference 2016 (the Association for Language Testing and Assessment of Australia and New Zealand); 17–19 November 2016. Auckland, New Zealand: The Association for Language Testing and Assessment of Australia and New Zealand; 2016. p. 23
  18. 18. Amma K. Partial scoring of reordering tasks with recovery distance as penalty. Journal of Informatics (Dokkyo University Information Science Research Institute). 2018;7:5-23 J-GLOBAL ID 201802283386410791; Reference number 18A0455950
  19. 19. Xojo. Xojo, Inc. Available from: https://www.xojo.com/ [Accessed: 19 January, 2022]
  20. 20. Amma K. Partial scoring of reordering tasks: Maximal relative sequence by excel. In: Proceedings of 2019 IEEE 2nd International Conference on Information and Computer Technologies (ICICT); 14–17 March 2019. Hawai‘i, USA: ICICT. pp. 19-24. DOI: 10.1109/INFOCT.2019.8711372. ISBN: 978-1-7281-3322-5. Available from: https://ieeexplore.ieee.org/document/8711372 [Accessed: 19 January, 2022]
  21. 21. Amma K. Partial scoring of reordering tasks revisited: Linearity matrix by excel. In: Proceedings of 2020 IEEE 3rd International Conference on Information and Computer Technologies (ICICT); 9–12 March 2020. Silicon Valley, USA: ICICT. pp. 1-6. DOI: 10.1109/ICICT50521.2020.00008. ISBN: 978-1-7281-7283-5. Available from: https://ieeexplore.ieee.org/document/9092126, https://conferences.computer.org/icict/2020/pdfs/ICICT2020-sQZ4BHZN9WMCBMwB1asUZ/728300a001/728300a001.pdf [Accessed: 19 January, 2022]
  22. 22. Hill LA. Elementary Steps to Understanding. Oxford: Oxford University Press; 1980. p. 30
  23. 23. JMP (version 13.0). Cary, NC: SAS Institute. Available from: https://www.jmp.com/ [Accessed: 07 April, 2022]

Written By

Amma Kazuo

Submitted: 19 January 2022 Reviewed: 26 January 2022 Published: 02 May 2022